Articles tagged c at null program

2026 has been the most pivotal year in my career… and it's only March

2026-03-29T21:38:22Z

In February I left my employer after nearly two decades of service. In the moment I was optimistic, yet unsure I made the right choice. Dust settled, I’m now absolutely sure I chose correctly. I’m happier and better for it. There were multiple factors, but it’s not mere chance it coincides with these early months of the automation of software engineering. I left an employer that is years behind adopting AI to one actively supporting and encouraging it. As of March, in my professional capacity I no longer write code myself. My current situation was unimaginable to me only a year ago. Like it or not, this is the future of software engineering. Turns out I like it, and having tasted the future I don’t want to go back to the old ways.

In case you’re worried, this is still me. These are my own words. Writing is thinking, and it would defeat the purpose for an AI to write in my place on my personal blog. That’s not going to change.

I still spend much time reading and understanding code, and using most of the same development tools. It’s more like being a manager, orchestrating a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing the vegetables, I conjure a helper to do it while I continue to run the kitchen. I haven’t managed people in some 20 years now, but I can feel those old muscles being put to use again as I improve at this new role. Will these kitchens still need human chefs like me by the end of the decade? Unclear, and it’s something we all need to prepare for.

My situation gave me an experience onboarding with AI assistance — a fast process given a near-instant, infinitely-patient helper answering any question about the code. By second week I was making substantial, wide contributions to the large C++ code base. It’s difficult to attach a quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for certain this wouldn’t have been possible without AI. The bottlenecks have shifted from producing code, which now takes relatively no time at all, to other points, and we’re all still trying to figure it out.

My personal programming has transformed as well. Everything I said about AI in late 2024 is, as I predicted, utterly obsolete. There’s a huge, growing gap between open weight models and the frontier. Models you can run yourself are toys. In general, almost any AI product or service worth your attention costs money. The free stuff is, at minimum, months behind. Most people only use limited, free services, so there’s a broad unawareness of just how far AI has advanced. AI is now highly skilled at programming, and better than me at almost every programming task, with inhumanly-low defect rates. The remaining issues are mainly steering problems: If AI code doesn’t do what I need, likely the AI writing it didn’t understand what I needed.

I’ll still write code myself from time to time for fun — minimalist, with my style and techniques — the same way I play shogi on the weekends for fun. However, artisan production is uneconomical in the presence of industrialization. AI makes programming so cheap that only the rich will write code by hand.

A small part of me is sad at what is lost. A bigger part is excited about the possibilities of the future. I’ve always had more ideas than time or energy to pursue them. With AI at my command, the problem changes shape. I can comfortably take on complexity from which I previously shied away, and I can take a shot at any idea sufficiently formed in my mind to prompt an AI — a whole skill of its own that I’m actively developing.

For instance, a couple weeks ago I put AI to work on a problem, and it produced a working solution for me after ~12 hours of continuous, autonomous work, literally while I slept. The past month w64devkit has burst with activity, almost entirely AI-driven. Some of it architectural changes I’ve wanted for years, but would require hours of tedious work, and so I never got around to it. AI knocked it out in minutes, with the new architecture opening new opportunities. It’s also taken on most of the cognitive load of maintenance.

Quilt.cpp

So far the my biggest, successful undertaking is Quilt.cpp, a C++ clone of Quilt, an early, actively-used source control system for patch management. Git is a glaring omission from the almost complete w64devkit, due platform and build issues. I’ve thought Quilt could fill some of that source control hole, except the original is written in Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since Quilt is conceptually simple, and I could lean on busybox-w32 diff and patch, I’ve considered writing my own implementation, just as I did pkg-config, but I never found the energy to do it.

Then I got good enough with AI to knock out a near feature-complete clone in about four days, including a built-in diff and patch so it doesn’t actually depend on external tools (except invoking $EDITOR). On Windows it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases. The source is distributed as an amalgamation, a single file quilt.cpp per its namesake:

$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file]  [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt  --help" for details on a specific command.

It supports Windows and POSIX, and runs ~5x faster than the original. AI developed it on Windows, Linux, and macOS: It’s best when the AI can close the debug loop and tackle problems autonomously without involving a human slowpoke. The handful of “not implemented” parts aren’t because they’re too hard — each would probably take an AI ~10 minutes — but deliberate decisions of taste.

There’s an irony that the reason I could produce Quilt.cpp with such ease is also a reason I don’t really need it anymore.

I changed the output of quilt mail to be more Git-compatible. The mbox produced by Quilt.cpp can be imported into Git with a plain git am:

$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox

The idea being that I could work on a machine without Git (e.g. Windows XP), and copy/mail the mbox to another machine where Git can absorb it as though it were in Git the whole time. git format-patch to quilt import sends commits in the opposite direction, useful for manually testing Quilt.cpp on real change sets.

To be clear, I could not have done this if the original Quilt did not exist as a working program. I began with an AI generating a conformance suite based on the original, its documentation, and other online documentation, validating that suite against the original implementation (see -DQUILT_TEST_EXECUTABLE). Then had another AI code to the tests, on architectural guidance from me, with -D_GLIBCXX_DEBUG and sanitizers as guardrails. That was day one. The next three days were lots of refining and iteration as I discover the gaps in the test suite. I’d prompt AI to compare Quilt.cpp to the original Quilt man page, add tests for missing features, validate the new tests against the original Quilt, then run several agents to fix the tests. While they worked I’d try the latest build and note any bugs. As of this writing, the result is about equal parts test and non-test, ~9KLoC each.

I’m likely to use this technique to clone other tools with implementations unsuitable for my purposes. I learned quite a bit from this first attempt.

Why C++ instead of my usual choice of C? As we know, conventional C is highly error-prone. Even AI has trouble with it. In the ~9k lines of C++ that is Quilt.cpp, I am only aware of three memory safety errors by the AI. Two were null-terminated string issues with strtol, where the AI was essentially writing C instead of C++, after which I directed the AI to use std::from_chars and drop as much direct libc use as possible. (The other was an unlikely branch with std::vector::back on an empty vector.) We can rescue C with better techniques like arena allocation, counted strings, and slices, but while (current) state of the art AI understands these things, it cannot work effectively with them in C. I’ve tried. So I picked C++, and from my professional work I know AI is better at C++ than me.

Also like a manager, I have not read most of the code, and instead focused on results, so you might say this was “vibe-coded.” It is thoroughly tested, though I’m sure there are still bugs to be ironed out, especially on the more esoteric features I haven’t tried by hand yet.

Let’s discuss tools

After opposing CMake for years, you may have noticed the latest w64devkit now includes CMake and Ninja. What happened? Preparing for my anticipated employment change, this past December I read Professional CMake. I realized that my practical problems with CMake were that nearly everyone uses it incorrectly. Most CMake builds are a disaster, but my new-found knowledge allows me to navigate the common mistakes. Only high profile open source projects manage to put together proper CMake builds. Otherwise the internet is loaded with CMake misinformation. Similar to AI, if you’re not paying for CMake knowledge then it’s likely wrong or misleading. So I highly recommend that book!

Frontier AI is very good with CMake. When a project has a CMake build that isn’t too badly broken, just tell AI to fix it, without any specifics, and build problems disappear in mere minutes without having to think about it. It’s awesome. Combine it with the previous discussion about tests making AI so much more effective, and that it also knows CTest well, and you’ve got a killer formula. I’m more effective with CTest myself merely from observing how AI uses it. AI (currently) cannot use debuggers, so putting powerful, familiar testing tools in its hands helps a lot, versus the usual bespoke, debugger-friendly solutions I prefer.

Similar to solving CMake problems: Have a hairy merge conflict? Just ask AI resolve it. It’s like magic. I no longer fear merge conflicts.

So part of my motivation for adding CMake to w64devkit was anticipation of projects like Quilt.cpp, where they’d be available to AI, or at least so I could use the tools the AI used to build/test myself. It’s already paid for itself, and there’s more to come.

For agent software, on personal projects I’m using Claude Code. It’s a great value, cheaper than paying API rates but requires working around 5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to have enough to explore all my ideas. Be warned: Anthropic software is quite buggy, more so than industry average, and it’s obvious that they never even start, let alone test, some of their released software on disfavored platforms (Windows, Android). Don’t expect to use Claude Code effectively for native Windows platform development, which sadly includes w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a bottleneck on QA, and unable to fit AI in that role they don’t bother. You can theoretically report bugs on GitHub, but they’re just ignored and closed. (Why don’t they have AI agents jumping on this wealth of bug reports?)

At work I’m using Cursor where I get a choice of models. My favorite for March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude Code by a small margin. It’s immediately obvious that Cursor is better agent software than Claude Code. It’s more robust, more featureful, and with a clearer UI than Claude Code. It has no trouble on Windows and can drive w64devkit flawlessly. It’s also more expensive than Claude Code. My employer currently spends ~US$250/mo on my AI tokens, dirt cheap considering what they’re getting out of it. I have bottlenecks elsewhere that keep me from spending even more.

Neither Cursor nor Claude Code are open source, so what are the purists to do, even if they’re willing to pay API rates for tokens? Sadly I have no answers for you. I haven’t gotten any open source agent software actually working, and it seems they may lack the necessary secret sauce.

Update: Several folks suggested I give OpenCode another shot, and this time I got over the configuration hurdle. Single executable, slick interface, and unlike Claude Code, I observed no bugs in my brief trial. Give that a shot if you’re looking for an open source client.

The future is going to be weird. My experience is only a peek at what’s to come, and my head is still spinning. However, the more I adapt to the changes, the better I feel. If you’re feeling anxious like I was, don’t flinch from improving your own AI knowledge and experience.

Frankenwine: Multiple personas in a Wine process

2026-01-19T21:51:38Z

I came across a recent article on making Linux system calls from a Wine process. Windows programs running under Wine are still normal Linux processes and may interact with the Linux kernel like any other process. None of this was surprising, and the demonstration works just as I expect. Still, it got the wheels spinning and I realized an almost practical application: build my pkg-config implementation such that on Windows pkg-config.exe behaves as a native pkg-config, but when run under Wine this same binary takes the persona of a Linux program and becomes a cross toolchain pkg-config, bypassing Win32 and talking directly with the Linux kernel. Cosmopolitcan Libc cleverly does this out-of-the-box, but in this article we’ll mash together a couple existing sources with a bit of glue.

The results are in the merge-demo branch of u-config, and took hardly any work:

$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)

A platform layer, main_wine.c, is a merge of two existing platform layers, one of which required unavoidable tweaks. We’ll get to those details in a moment. First we’ll need to detect if we’re running under Wine, and the best solution I found was to locate ntdll!wine_get_version. If this function exists, we’re in Wine. That works out to a pretty one-liner because ntdll.dll is already loaded:

bool running_on_wine()
{
    return GetProcAddress(GetModuleHandleA("ntdll"), "wine_get_version");
}

An x86-64 Linux syscall wrapper with thorough inline assembly:

ptrdiff_t syscall3(int n, ptrdiff_t a, ptrdiff_t b, ptrdiff_t c)
{
    ptrdiff_t r;
    asm volatile (
        "syscall"
        : "=a"(r)
        : "a"(n), "D"(a), "S"(b), "d"(c)
        : "rcx", "r11", "memory"
    );
    return r;
}

ptrdiff_t write(int fd, void *buf, ptrdiff_t len)
{
    return syscall3(SYS_write, fd, (ptrdiff_t)buf, len);
}

I’d normally use long for all these integers because Linux is LP64 (long is pointer-sized), but Windows is LLP64 (only long long is 64 bits). It’s so bizarre to interface with Linux from LLP64, and this will have consequences later. With these pieces we can see the basic shape of a split personality program:

    if (running_on_wine()) {
        write(1, "hello, wine\n", 12);
    } else {
        HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
        WriteFile(h, "hello, windows\n", 15, 0, 0);
    }

We can cram two programs into this binary and select which program at run time depending on what we see. In typical programs locating and calling into glibc would be a challenge, particularly with the incompatible ABIs involved. We’re avoiding it here by interfacing directly with the kernel.

Application to u-config

Luckily u-config has completely-optional platform layers implemented with Linux system calls. The POSIX platform layer works fine, and that’s what distributions should generally use, but these bonus platforms are unhosted and do not require libc. That means we can shove it into a Windows build with relatively little trouble.

Before we do that, let’s think about what we’re doing. Debian has great cross toolchain support, including Mingw-w64. There are even a few Windows libraries in the Debian package repository, such as zlib, and we can build Windows programs against them. If you’re cross-building and using pkg-config, you ought to use the cross toolchain pkg-config, which in GNU ecosystems gets an architecture prefix like the other cross tools. Debian cross toolchains each include a cross pkg-config, and it sometimes almost works correctly! Here’s what I get on Debian 13:

$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz

Note the architecture in the -I and -L options. It really is querying the cross sysroot. Though these paths are in the cross sysroot, and so should not be listed by pkg-config. It’s unoptimal and indicates this pkg-config is probably misconfigured. In other cases it’s far from correct:

$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...

A tool prefixed x86_64-w64-mingw32- should not produce paths containing x86_64-linux-gnu (the host architecture in this case). Our version won’t have these issues.

The u-config platform interface is five functions:

filemap os_mapfile(os *, arena *, s8 path);  // read whole files
s8node *os_listing(os *, arena *, s8 path);  // list directories
void    os_write(os *, i32 fd, s8);          // standard out/err
void    os_fail(os *);                       // non-zero exit

void uconfig(config *);

Platforms implement the first four functions, and call uconfig() with the platform’s configuration, context pointer (os *), command line arguments, environment, and some memory (all in the config object). My strategy is to link two platforms into the binary, and the first challenge is they both define os_write, etc. I did not plan nor intend for one binary to contain more than one platform layer. Unity builds offer a fix without changing a single line of code:

#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include "main_windows.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include "main_linux_amd64.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

This dirty, but effective trick may look familiar. It also doesn’t interfere with the other builds. Next I define the real platform functions as a dispatch based on our run-time situation:

b32 wine_detected;

filemap os_mapfile(os *ctx, arena *a, s8 path)
{
    if (wine_detected) {
        return linux_mapfile(ctx, a, path);
    } else {
        return win32_mapfile(ctx, a, path);
    }
}

If I were serious about keeping this experiment, I’d lift os as I did the functions (as win32_os, linux_os) and include wine_detected in the context, eliminating this global variable. That cannot be done with simple hacks and macros.

The next challenge is that I wrote the Linux platform layer assuming LP64, and so it uses long instead of an equivalent platform-agnostic type like ptrdiff_t. I never thought this would be an issue because this source literally contains asm blocks and no conditional compilation, yet here we are. Lesson learned. I wanted to try an extremely janky #define on long to fix it, but this source file has a couple long long that won’t play along. These multi-token type names of C are antithetical to its preprocessor! So I adjusted the source manually instead.

The Windows and Linux platform entry points are completely different, both in name and form, and so co-exist naturally. The merged platform layer is a new entry point that will pass control to the appropriate entry point:

void entrypoint(ptrdiff_t *stack);  // Linux
void __stdcall mainCRTStartup();    // Windows

On Linux stack is the initial value of the stack pointer, which points to argc, argv, envp, and auxv. We’ll need construct an artificial “stack” for the Linux platform layer to harvest. On Windows this is the process entry point, and it will find the rest on its own as a normal Windows process. Ultimately this ended up simpler than I expected:

void __stdcall merge_entrypoint()
{
    wine_detected = running_on_wine();
    if (wine_detected) {
        u8 *fakestack[CMDLINE_ARGV_MAX+1];
        c16 *cmd = GetCommandLineW();
        fakestack[0] = (u8 *)(iz)cmdline_to_argv8(cmd, fakestack+1);
        // TODO: append envp to the fake stack
        entrypoint((iz *)fakestack);
    } else {
        mainCRTStartup();
    }
}

Where cmdline_to_argv8 is my Windows argument parser, already used by u-config, and I reserve one element at the front to store argc. Since this is just a proof-of-concept I didn’t bother fabricating and pushing envp onto the fake stack. The Linux entry point doesn’t need auxv and can be omitted. Once in the Linux entry point it’s essentially a Linux process from then on, except the x64 calling convention still in use internally.

Finally, I configure the Linux platform layer for Debian’s cross sysroot:

#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "/usr/x86_64-w64-mingw32/lib"

And that’s it! We have our platform merge. Build (w64devkit):

$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c

On Debian use x86_64-w64-mingw32-gcc for cc. The -e linker option selects the new, higher level entry point. After installing Wine binfmt, here’s how it looks on Debian:

$ ./pkg-config.exe --cflags --libs zlib
-lz

That’s the correct output, but is it using the cross sysroot? Ask it to include the -I argument despite it being in the cross sysroot:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz

Looking good! It passes the pc_path test, too:

$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig

Running this same binary on Windows after installing zlib in w64devkit:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz

Also:

$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig

My Frankenwine is a success!

WebAssembly as a Python extension platform

2026-01-01T21:21:19Z

Software above some complexity level tends to sport an extension language, becoming a kind of software platform itself. Lua fills this role well, and of course there’s JavaScript for web technologies. WebAssembly generalizes this, and any Wasm-targeting programming language can extend a Wasm-hosting application. It has more friction than supplying a script in a text file, but extension authors can write in their language of choice, and use more polished development tools — debugging, testing, etc. — than typically available for a typical extension language. Python is traditionally extended through native code behind a C interface, but it’s recently become practical to extend Python with Wasm. That is we can ship an architecture-independent Wasm blob inside a Python library, and use it without requiring a native toolchain on the host system. Let’s discuss two different use cases and their pitfalls.

Normally we’d extend Python in order to access an external interface that Python cannot access on its own. Wasm runs in a sandbox with no access to the outside world whatsoever, so it obviously isn’t useful for that case. Extensions may also grant Python more speed, which is one of Wasm’s main selling points. We can also use Wasm to access embeddable capabilities written in a different programming language which do not require external access.

For preferred non-WASI Wasm runtime is Volodymyr Shymanskyy’s wasm3. It’s plain old C and very friendly to embedding in the same was as, say, SQLite. Performance is middling, though a C program running on wasm3 is still quite a bit faster than an equivalent Python program. It has Python bindings, pywasm3, but it’s distributed only in source code form. That is, the host machine must have a C toolchain in order to use pywasm3, which defeats my purposes here. If there’s a C toolchain, I might as well just use that instead of going through Wasm.

For the use cases in this article, the best option is wasmtime-py. The distribution includes binaries for Windows, macOS, and Linux on x86-64 and ARM64, which covers nearly all Python installations. Hosts require nothing more than a Python interpreter, no native toolchains. It’s almost as good as having Wasm built into Python itself. In my tests it’s 3x–10x faster than wasm3, so for my first use case the situation is even better. The catch is that it currently weighs ~18MiB (installed), and in the future will likely rival the Python interpreter itself. The API also breaks on a monthly basis, so you’re signing up for the upgrade treadmill lest your own program perishes to bitrot after a couple of years. This article is about version 40.

Usage examples and gotchas

The official examples don’t do anything non-trivial or interesting, and so to figure things out I had to study the documentation, which does not offer many hints. Basic setup looks like this:

import functools
import wasmtime

store    = wasmtime.Store()
module   = wasmtime.Module.from_file(store.engine, "example.wasm")
instance = wasmtime.Instance(store, module, ())
exports  = instance.exports(store)

memory = exports["memory"].get_buffer_ptr(store)
func1  = functools.partial(exports["func1"], store)
func2  = functools.partial(exports["func2"], store)
func3  = functools.partial(exports["func3"], store)

A store is an allocation region from which we allocate all Wasm objects. It is not possible to free individual objects except to discard the whole store. Quite sensible, honestly. What’s not sensible is how often I have to repeat myself, passing the store back into every object in order to use it. These objects are associated with exactly one store and cannot be used with different stores. Use the wrong store and it panics: It’s already keeping track internally! I do not understand why the interface works this way. So to make things simpler, I use functools.partial to bind the store parameter and so get the interface I expect.

The get_buffer_ptr object is a buffer protocol object, and if you’re moving anything other than bytes that’s probably what you want to use to access memory. The usual caveats apply for this object: If you change the memory size you probably want to grab a fresh buffer object. For bytes (e.g. buffers and strings) I prefer the read and write methods.

Because multi-value is still in an experimental state in the Wasm ecosystem, you will likely not pass structs with Wasm. Anything more complicated than scalars will require pointers and copying data in and out of Wasm linear memory. This involves the usual trap that catches nearly everyone: Wasm interfaces make no distinction between pointers and integers, and Wasm runtimes interpret generally interpret all integers as signed. What that means is your pointers are signed unless you take action. Addresses start at 0, so this is bad, bad news.

malloc = functools.partial(exports["func1"], store)

hello = b"hello"
pointer = malloc(len(hello))
assert pointer
memory = exports["memory"].write(store, hello, pointer)  # WRONG!

To make matters worse, wasmtime-py adds its own footgun: The read and write methods adopt the questionable Python convention of negative indices acting from the end. If malloc returns a pointer in the upper half of memory, the negative pointer will pass the bounds check inside write because negative is valid, then quietly store to the wrong address! Doh!

I wondered how common this error, so I searched online. I could find only one non-trivial wasmtime-py use in the wild, in a sandboxed PDF reader. It falls into the negative pointer trap as I expected. Not only that, it’s a buffer overflow into Python’s memory space:

            buf_ptr = malloc(store, len(pdf_data))
            mem_data = memory.data_ptr(store)

            for i, byte in enumerate(pdf_data):
                mem_data[buf_ptr + i] = byte

The data_ptr method returns a non-bounds-checked raw ctypes pointer, so this is actually a double mistake. First, it shouldn’t trust pointers coming out of Wasm if it cares at all about sandboxing. The second is the potential negative pointer, which in this case would write outside of the Wasm memory and in Python’s memory, hopefully seg-faulting.

What’s one to do? Every pointer coming out of Wasm must be truncated with a mask:

pointer = malloc(...) & 0xffffffff   # correct for wasm32!

This interprets the result as unsigned. 64-bit Wasm needs a 64-bit mask, though in practice you will never get a valid negative pointer from 64-bit Wasm. This rule applies to JavaScript as well, where the idiom is:

let pointer = malloc(...) >>> 0

Wasm runtimes cannot help — they lack the necessary information — and this is perhaps a fundamental flaw in Wasm’s design. Once you know about it you see this mistake happening everywhere.

Now that you have a proper address, you can apply it to a buffer protocol view of memory. If you’re using NumPy there are various ways to interact with this memory by wrapping it in NumPy types, though only if you’re on a little endian host. (If you’re on a big endian machine, just give up on running Wasm anyway.) The first use case I have in mind typically involves copying plain Python values in and out. The struct package is quite handy here:

vec2   = malloc(...) & 0xffffffff
memory = exports["memory"].get_buffer_ptr(store)
struct.pack_into(", memory, vec2, x, y)

It fills a similar role to JavaScript DataView. If you’re copying lots of numbers, with CPython it’s faster to construct a custom format string rather than use a loop:

nums: list[int] = ...
struct.pack_into(f"<{len(nums)}i", memory, buf, *nums)

To copy structures back out, use struct.unpack_from. If you’re moving strings, you’ll need to .encode() and .decode() to convert to and from bytes, which are well-suited to read and write.

In practice with real Wasm programs you’re going to be interacting with the “guest” allocator from the outside, to request memory into which you copy inputs for a function. In my examples I’ve used malloc because it requires no elaboration, but as usual a bump allocator solves this so much better, especially because it doesn’t require stuffing a whole general purpose allocator inside the Wasm program. Have one global arena — no other threads will sharing that Wasm instance — rapid fire a bunch of allocations as needed without any concern for memory management in the “host”, call the function, which might allocate a result from that arena, then reset the arena to clean up. In essence a stack for passing values in and out.

WebAssembly as faster Python

Suppose we noticed a computational hot spot in our Python program in a pure Python function (e.g. not calling out to an extension). Optimizing this function would be wise. Based on my experiments if I re-implement that function in C, compile it to Wasm, then run that bit of Wasm in place of the original function, I can expect around a 10x speed-up. In general C is more like 100x faster than Python, and the overhead of interfacing with Wasm — copying stuff in and out, etc. — can be high, but not so high as to not be profitable. This improves further if I can change the interface, e.g. require callers to use the buffer protocol.

Thanks to wasmtime-py, I could introduce this change without fussing with cross-compilers to build distribution binaries, nor require a toolchain on the target, just a hefty Python package. Might be worth it.

My main experimental benchmark is a variation on my solution to the “Two Sum” problem, which I originally wrote for JavaScript, then extended to pywasm3 and later wasmtime-py. It’s simple, just interesting enough, and representative of the sort of Wasm drop-in I have in mind. It has the same interface, but implements it with Wasm.

# Original Pythonic interface
def twosum(nums: list[int], target: int) -> tuple[int, int] | None:
    ...

# Stateful Wasm interface
class TwoSumWasm():
    def __init__(self):
        store    = wasmtime.Store()
        module   = wasmtime.Module.from_file(store.engine, ...)
        instance = wasmtime.Instance(store, module, ())
        ...

    def twosum(self, nums, target):
        # ... use wasm instance ...

There’s some state to it with the Wasm instance in tow. If you hide that by making it global you’ll need to synchronize your threads around it. In a multi-threaded program perhaps these would be lazily-constructed thread locals. I haven’t had to solve this yet.

However, the weakness of the wasmtime “store” really shows: Notice how compilation and instantiation are bound together in one store? ~~I cannot compile once and then create disposable instances on the fly~~, e.g. as required for each run of a WASI program. Every instance permanently extends the compilation store. In practice we must wastefully re-compile the Wasm program for each disposable instance. Despite appearances, compilation and instantiation are not actually distinct steps, as they are in JavaScript’s Wasm API. wasmtime.Instance accepts a store as its first argument, suggesting use of a different store for instantiation. That would solve this problem, but as of this writing it must be the same store used to compile the module. ~~This is a fatal flaw for certain real use cases, particularly WASI.~~

Update: Wolfgang Meier points out the serialize and deserialize methods, which detaches a compiled module from its store, allowing for independent instantations. I tried it, and it’s a practical workaround. Overhead is low; no validation when deserializing. My benchmark now does it for future reference, as I expect it to be my typical use case.

WebAssembly as embedded capabilities

Loup Vaillant’s Monocypher is a wonderful cryptography library. Lean, efficient, and embedding-friendly, so much so it’s distributed in amalgamated form. It requires no libc or runtime, so we can compile it straight to Wasm with almost any Clang toolchain:

$ clang --target=wasm32 -nostdlib -O2 -Wl,--no-entry -Wl,--export-all
        -o monocypher.wasm monocypher.c

It’s not “Wasm-aware” so I need --export-all to expose the interface. This is swell because, as single translation unit, anything with external linkage is the interface. Though remember what I said about interacting with the guest allocator? This has no allocator, nor should it. It’s not so usable in this form because we’d need to manage memory from the outside. Do-able, but it’s easy to improve by adding a couple more functions, sticking to a single translation unit:

#include "monocypher.c"

extern char  __heap_base[];
static char *heap_used;
static char *heap_high;

void *bump_alloc(ptrdiff_t size)
{
    // ...
}

void bump_reset()
{
    ptrdiff_t len = heap_used - __heap_base;
    __builtin_memset(__heap_base, 0, len);  // wipe keys, etc.
    heap_used = __heap_base;
}

I’ve discussed __heap_base before, which is part of the ABI. We’ll push keys, inputs, etc. onto this “stack”, run our cryptography routine, copy out the result, then reset the bump allocator, which wipes out all sensitive data. Often memset is insufficient — typically it’s zero-then-free, and compilers see the lifetime about to end — but no lifetime ends here, and stores to this “heap” memory externally observable as far as the abstract machine can tell. (Otherwise we couldn’t reliably copy out our results!)

There’s a lot to this API, but I’m only going to look at the AEAD interface. We “lock” up some data in an encrypted box, write any unencrypted label we’d like on the outside. Then later we can unlock the box, which will only open for us if neither the contents of the box nor the label were tampered with. That’s some solid API design:

void crypto_aead_lock(uint8_t       *cipher_text,
                      uint8_t        mac  [16],
                      const uint8_t  key  [32],
                      const uint8_t  nonce[24],
                      const uint8_t *ad,         size_t ad_size,
                      const uint8_t *plain_text, size_t text_size);
int crypto_aead_unlock(uint8_t       *plain_text,
                       const uint8_t  mac  [16],
                       const uint8_t  key  [32],
                       const uint8_t  nonce[24],
                       const uint8_t *ad,          size_t ad_size,
                       const uint8_t *cipher_text, size_t text_size);

By compiling to Wasm we can access this functionality from Python almost like it was pure Python, and interact with other systems using Monocypher.

Since Monocypher does not interact with the outside world on its own, it relies on callers to use their system’s CSPRNG to create those nonces and keys, which we’ll do using the secrets built-in package:

class Monocypher:
    def __init__(self):
        ...
        self._read   = functools.partial(memory.read, store)
        self._write  = functools.partial(memory.write, store)
        self.__alloc = functools.partial(exports["bump_alloc"], store)
        self._reset  = functools.partial(exports["bump_reset"], store)
        self._lock   = functools.partial(exports["crypto_aead_lock"], store)
        self._unlock = functools.partial(exports["crypto_aead_unlock"], store)
        self._csprng = secrets.SystemRandom()

    def _alloc(self, n):
        return self.__alloc(n) & 0xffffffff

    def generate_key(self):
        return self._csprng.randbytes(32)

    def generate_nonce(self):
        return self._csprng.randbytes(24)

    ...

With a solid foundation, all that follows comes easily. A finally guarantees secrets are always removed from Wasm memory, and the rest is just about copying bytes around:

    def aead_lock(self, text, key, ad = b""):
        assert len(key) == 32
        try:
            macptr   = self._alloc(16)
            keyptr   = self._alloc(32)
            nonceptr = self._alloc(24)
            adptr    = self._alloc(len(ad))
            textptr  = self._alloc(len(text))

            self._write(key, keyptr)
            nonce = self.generate_nonce()
            self._write(nonce, nonceptr)
            self._write(ad,    adptr)
            self._write(text,  textptr)

            self._lock(
                textptr,
                macptr,
                keyptr,
                nonceptr,
                adptr, len(ad),
                textptr, len(text),
            )
            return (
                self._read(macptr, macptr+16),
                nonce,
                self._read(textptr, textptr+len(text)),
            )
        finally:
            self._reset()

And aead_unlock is basically the same in reverse, but throws if the box fails to unlock, perhaps due to tampering:

    def aead_unlock(self, text, mac, key, nonce, ad = b""):
        assert len(mac) == 16
        assert len(key) == 32
        assert len(nonce) == 24
        try:
            macptr   = self._alloc(16)
            keyptr   = self._alloc(32)
            nonceptr = self._alloc(24)
            adptr    = self._alloc(len(ad))
            textptr  = self._alloc(len(text))

            self._write(mac, macptr)
            self._write(key, keyptr)
            self._write(nonce, nonceptr)
            self._write(ad, adptr)
            self._write(text, textptr)

            if self._unlock(
                textptr,
                macptr,
                keyptr,
                nonceptr,
                adptr, len(ad),
                textptr, len(text),
            ):
                raise ValueError("AEAD mismatch")
            return self._read(textptr, textptr+len(text))
        finally:
            self._reset()

Usage:

mc = Monocypher()
key = mc.generate_key()
message = "Hello, world!"
mac, nonce, encrypted = mc.aead_lock(message.encode(), key)

Transmit mac, nonce, and encrypted to the other party (or your future self), who already has the key:

decrypted = mc.aead_unlock(encrypted, mac, key, nonce)

Find the complete source in my scratch repository.

While I have a few reservations about wasmtime-py, it fascinates me how well this all works. It’s been my hammer in search of a nail for some time now.

Freestyle linked lists tricks

2025-12-31T11:59:59Z

Linked lists are a data structure basic building block, with especially flexible allocation behavior. They’re not just a useful starting point, but sometimes a sound foundation for future growth. I’m going to start with the beginner stuff, then without disrupting the original linked list, enhance it with new capabilities.

Linked list basics

For the sake of an interesting example, I’m will demonstrate with the same concept as last time I talked about data structures: a collection of key/value strings, in the form of an environment variables. This time in linked list form:

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

uint64_t hash64(Str);
bool     equals(Str, Str);

typedef struct Env Env;
struct Env {
    Env *next;
    Str  key;
    Str  value;
};

It will be sourced from some string, formatted like the env program:

    Str input = S(
        "EDITOR=vim\n"
        "HOME=/home/user\n"
        "PATH=/bin:/usr/bin\n"
        "SHELL=/bin/bash\n"
        "TERM=xterm-256color\n"
        "USER=user\n"
        "SHELL=/bin/sh\n"   // <- repeated entry
    );

And all the parser heavy lifting will be done by our ever-handy cut function:

typedef struct {
    Str tail;
    Str head;
} Cut;

Cut cut(Str, char);

The simplest way to build up a linked list is like a stack, pushing objects into the front. Zero-initialized head pointer, point the new node at it, then make that node the new head element:

Env *parse_reversed(Str s, Arena *a)
{
    Env *head = 0;  // 1
    for (Cut line = {s}; line.tail.len;) {
        line = cut(line.tail, '\n');
        Cut  pair  = cut(line.head, '=');
        Env *env   = new(a, 1, Env);
        env->key   = pair.head;
        env->value = pair.tail;
        env->next  = head;  // 2
        head = env;  // 3
    }
    return head;
}

That’s it, a complete linked list implementation in three lines of code. No big deal. Because of the bump allocator, nodes are packed in order in memory, so the usual cache objections for linked lists do not apply. LIFO semantics mean the linked list is in reverse order from the source order. If we’re doing a linear scan through the linked list, the last entry in the source wins, which may be what you wanted:

Str lookup_linear(Env *env, Str key)
{
    for (Env *var = env; var; var = var->next) {
        if (equals(key, var->key)) {
            return var->value;
        }
    }
    return (Str){};
}

    // ...
    Env *env  = parse_reversed(input, &scratch);
    Str value = lookup_linear(env, S("SHELL"));  // <- "/bin/sh"

It’s just one more line of code to maintain the original order, using a very simple double-pointer technique:

Env *parse_ordered(Str s, Arena *a)
{
    Env  *head = 0;  // 1
    Env **tail = &head;  // 2
    for (Cut line = {s}; line.tail.len;) {
        // ...
        *tail = env;  // 3
        tail = &env->next;  // 4
    }
    return head;
}

No branches necessary, nor dummy nodes. A pointer to the last pointer in the list works even for empty lists. The tail pointer is unneeded once the list is complete. This form has queue behavior.

Faster look-up with a tree

If you’re doing many look-ups, or if the list is long, those linear scans to find items in the list are not ideal. We can introduce an intrusive hash map, in the form of a hash trie, by adding two more pointers to the linked list:

typedef struct Env Env;
struct Env {
    Env *next;
    Env *child[2];  // <- hash map linkage
    Str  key;
    Str  value;
};

I’ve found it’s simplest to construct a node into the hash map, then link it onto the list tail. That constructor looks like this:

Env *new_env(Arena *a, Env **env, Str key, Str value)
{
    for (uint64_t h = hash64(key); *env; h <<= 1) {
        env = &(*env)->child[h>>63];
    }
    *env = new(a, 1, Env);
    (*env)->key = key;
    (*env)->value = value;
    return *env;
}

Then we swap that into the head/tail version in place of the original new macro call:

Env *parse_mapped(Str s, Arena *a)
{
    Env  *head = 0;
    Env **tail = &head;
    for (Cut line = {s}; line.tail.len;) {
        // ...
        Env *env = new_env(a, &head, pair.head, pair.tail);
        *tail = env;
        tail = &env->next;
    }
    return head;
}

This is now a linked list and a hash map at the same time, built-up piece by piece without any resizing. We still have the original linked list, but we can now search it in log time. The look-up function resembles the constructor:

Str lookup_logn(Env *env, Str key)
{
    for (uint64_t h = hash64(key); env; h <<= 1) {
        if (equals(key, env->key)) {
            return env->value;
        }
        env = env->child[h>>63];
    }
    return (Str){};
}

Because of the FIFO semantics, it finds the first match in the source:

    Env *env   = parse_mapped(input, &scratch);
    Str  value = lookup_logn(env, S("SHELL"));  // <- /bin/bash

The other matches are also in the tree, and we can find those as well by continuing traversal. That is, it’s already a multi-map. This particular interface can’t pick up where it left off, but we can build one that does using an iterator/cursor:

typedef struct {
    uint64_t hash;
    Str      key;
    Env     *env;
} EnvIter;

EnvIter new_enviter(Env *env, Str key)
{
    return (EnvIter){hash64(key), key, env};
}

Str enviter_next(EnvIter *it)
{
    while (it->env) {
        Env *cur = it->env;
        it->env = it->env->child[it->hash>>63];
        it->hash <<= 1;
        if (equals(it->key, cur->key)) {
            return cur->value;
        }
    }
    return (Str){};
}

Update: Thanks to Daniel Kareh for a correction.

Then we can use a loop to visit every match in source order:

    Env *env = parse_mapped(input, &scratch);
    for (EnvIter it = new_enviter(env, S("SHELL"));;) {
        Str value = enviter_next(&it);
        if (!value.data) break;
        // ...
    }

Faster look-up with an index table

If the list is static once constructed, or if look-ups happen much more frequently than the list grows, we can find list items even faster by constructing an index table over the list: an MSI hash table. This table avoids redundancy by sharing structure with the list. Because it’s a flat table, if we keep adding to the list then eventually we’ll need to reconstruct a larger table when it becomes overloaded.

The table itself has a very simple structure, just an array and its size, expressed as a power-of-two exponent:

typedef struct {
    Env **slots;
    int   exp;
} EnvTable;

We do not need the child nodes, and so linked list nodes are untouched. That is, it’s not intrusive. In fact, we can build any arbitrary number of tables over a list, perhaps indexing different properties for different sorts of queries. The idea is that we build the list first, then create the table:

EnvTable new_table(Arena *a, Env *env)
{
    // Compute list length
    ptrdiff_t len = 0;
    for (Env *var = env; var; var = var->next) {
        len++;
    }

    // Then compute an appropriate table size
    EnvTable table = {};
    table.exp = 3;
    ptrdiff_t one = 1;
    for (; (one<<table.exp) - (one<<(table.exp-3)) < len; table.exp++) {}
    table.slots = new(a, one<<table.exp, Env *);

    // Then insert linked list items into the table
    for (Env *var = env; var; var = var->next) {
        uint64_t hash = hash64(var->key);
        size_t   mask = ((size_t)1 << table.exp) - 1;
        size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
        for (size_t i = (size_t)hash;;) {
            i = (i + step) & mask;
            if (!table.slots[i]) {
                table.slots[i] = var;
                break;
            }
        }
    }

    return table;
}

Note how only searches for an empty slot, not for a matching entry. That’s because this too is a multi-map, also with elements in insertion order. Look-ups are constant time:

Str lookup_constant(EnvTable table, Str key)
{
    uint64_t hash = hash64(key);
    size_t   mask = ((size_t)1 << table.exp) - 1;
    size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
    for (size_t i = (size_t)hash;;) {
        i = (i + step) & mask;
        if (!table.slots[i]) {
            return (Str){};
        } else if (equals(table.slots[i]->key, key)) {
            return table.slots[i]->value;
        }
    }
}

It finds the earliest match in the list, meaning an index over the “reverse” list will find the last entry in the source. The indexed-over property is the input to hash64 and equals. By using a different input to these functions we could build another table on, say, value length if that’s a property on which we needed to find elements efficiently. Again, for multi-map iteration we need some kind of iterator or cursor:

typedef struct {
    EnvTable table;
    Str      key;
    size_t   step;
    size_t   i;
} TableIter;

TableIter new_tableiter(EnvTable table, Str key)
{
    uint64_t hash = hash64(key);
    size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
    size_t   idx  = (size_t)hash;
    return (TableIter){table, key, step, idx};
}

Str table_next(TableIter *it)
{
    size_t mask  = ((size_t)1 << it->table.exp) - 1;
    Env  **slots = it->table.slots;
    for (;;) {
        it->i = (it->i + it->step) & mask;
        if (!slots[it->i]) {
            return (Str){};
        } else if (equals(slots[it->i]->key, it->key)) {
            return slots[it->i]->value;
        }
    }
}

Its usage looks just like the other multi-map:

    Env *env = parse_ordered(input, &scratch);
    EnvTable table = new_table(&scratch, env);
    for (TableIter it = new_tableiter(table, S("SHELL"));;) {
        Str value = table_next(&it);
        if (!value.data) break;
        // ...
    }

With these techniques at hand, I can start with linked lists when they are convenient, and later add needed features without fundamentally changing the underlying data structure. None of this requires runtime support, and so it fits comfortably on embedded systems, tiny WebAssembly programs, etc. All the above code is available ready to run: list.c.

Unix "find" expressions compiled to bytecode

2025-12-23T04:20:22Z

In preparation for a future project, I was thinking about at the unix find utility. It operates a file system hierarchies, with basic operations selected and filtered using a specialized expression language. Users compose operations using unary and binary operators, grouping with parentheses for precedence. find may apply the expression to a great many files, so compiling it into a bytecode, resolving as much as possible ahead of time, and minimizing the per-element work, seems like a prudent implementation strategy. With some thought, I worked out a technique to do so, which was simpler than I expected, and I’m pleased with the results. I was later surprised all the real world find implementations I examined use tree-walk interpreters instead. This article describes how my compiler works, with a runnable example, and lists ideas for improvements.

For a quick overview, the syntax looks like this:

$ find [-H|-L] path... [expression...]

Technically at least one path is required, but most implementations imply . when none are provided. If no expression is supplied, the default is -print, e.g. print everything under each listed path. This prints the whole tree, including directories, under the current directory:

$ find .

To only print files, we could use -type f:

$ find . -type f -a -print

Where -a is the logical AND binary operator. -print always evaluates to true. It’s never necessary to write -a, and adjacent operations are implicitly joined with -a. We can keep chaining them, such as finding all executable files:

$ find . -type f -executable -print

If no -exec, -ok, or -print (or similar side-effect extensions like -print0 or -delete) are present, the whole expression is wrapped in an implicit ( expr ) -print. So we could also write this:

$ find . -type f -executable

Use -o for logical OR. To print all files with the executable bit or with a .exe extension:

$ find . -type f \( -executable -o -name '*.exe' \)

I needed parentheses because -o has lower precedence than -a, and because parentheses are shell metacharacters I also needed to escape them for the shell. It’s a shame find didn’t use [ and ] instead! There’s also a unary logical NOT operator, !. To print all non-executable files:

$ find . -type f ! -executable

Binary operators are short-circuiting, so this:

$ find -type d -a -exec du -sh {} +

Only lists the sizes of directories, as the -type d fails causing the whole expression to evaluate to false without evaluating -exec. Or equivalently with -o:

$ find ! -type d -o -exec du -sh {} +

If it’s not a directory then the left-hand side evaluates to true, and the right-hand side is not evaluated. All three implementations I examined (GNU, BSD, BusyBox) have a -regex extension, and eagerly compile the regular expression even if the operation is never evaluated:

$ find . -print -o -regex [
find: bad regex '[': Invalid regular expression

I was surprised by this because it doesn’t seem to be in the spirit of the original utility (“The second expression shall not be evaluated if the first expression is true.”), and I’m used to the idea of short-circuit validation for the right-hand side of a logical expression. Recompiling for each evaluation would be unwise, but it could happen lazily such that an invalid regular expression only causes an error if it’s actually used. No big deal, just a curiosity.

Bytecode design

A bytecode interpreter needs to track just one result at a time, making it a single register machine, with a 1-bit register at that. I came up with these five opcodes:

halt
not
braf   LABEL
brat   LABEL
action NAME [ARGS...]

Obviously halt stops the program. While I could just let it “run off the end” it’s useful to have an actual instruction so that I can attach a label and jump to it. The not opcode negates the register. braf is “branch if false”, jumping (via relative immediate) to the labeled (in printed form) instruction if the register is false. brat is “branch if true”. Together they implement the -a and -o operators. In practice there are no loops and jumps are always forward: find is not Turing complete.

In a real implementation each possible action (-name, -ok, -print, -type, etc.) would get a dedicated opcode. This requires implementing each operator, at least in part, in order to correctly parse the whole find expression. For now I’m just focused on the bytecode compiler, so this opcode is a stand-in, and it kind of pretends based on looks. Each action sets the register, and actions like -print always set it to true. My compiler is called findc (“find compiler”).

Update: Or try the online demo via Wasm! This version includes a peephole optimizer I wrote after publishing this article.

I assume readers of this program are familiar with push macro and Slice macro. Because of the latter it requires a very recent C compiler, like GCC 15 (e.g. via w64devkit) or Clang 22. Try out some find commands and see how they appear as bytecode. The simplest case is also optimal:

$ findc
// path: .
        action  -print
        halt

Print the path then halt. Simple. Stepping it up:

$ findc -type f -executable
// path: .
        action  -type f
        braf    L1
        action  -executable
L1:     braf    L2
        action  -print
L2:     halt

If the path is not a file, it skips over the rest of the program by way of the second branch instruction. It’s correct, but already we can see room for improvement. This would be better:

        action  -type f
        braf    L1
        action  -executable
        braf    L1
        action  -print
L1:     halt

More complex still:

$ findc -type f \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L1
        action  -name *.exe
L1:     braf    L2
        action  -print
L2:     halt

Inside the parentheses, if -executable succeeds, the right-hand side is skipped. Though the brat jumps straight to a braf. It would be better to jump ahead one more instruction:

        action  -type f
        braf    L2
        action  -executable
        brat    L1
        action  -name *.exe
        braf    L2
L1      action  -print
L2:     halt

Silly things aren’t optimized either:

$ findc ! ! -executable
// path: .
        action  -executable
        not
        not
        braf    L1
        action  -print
L1:     halt

Two not in a row cancel out, and so these instructions could be eliminated. Overall this compiler could benefit from a peephole optimizer, scanning over the program repeatedly, making small improvements until no more can be made:

Delete not-not.
A brat to a braf re-targets ahead one instruction, and vice versa.
Jumping onto an identical jump adopts its target for itself.
A not-braf might convert to a brat, and vice versa.
Delete side-effect-free instructions before halt (e.g. not-halt).
Exploit always-true actions, e.g. -print-braf can drop the branch.

Writing a bunch of peephole pattern matchers sounds kind of fun. Though my compiler would first need a slightly richer representation in order to detect and fix up changes to branches. One more for the road:

$ findc -type f ! \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L2
        action  -name *.exe
L2:     not
L1:     braf    L3
        action  -print
L3:     halt

The unoptimal jumps hint at my compiler’s structure. If you’re feeling up for a challenge, pause here to consider how you’d build this compiler, and how it might produce these particular artifacts.

Parsing and compiling

Before I even considered the shape of the bytecode I knew I needed to convert find infix into a compiler-friendly postfix. That is, this:

-type f -a ! ( -executable -o -name *.exe )

Becomes:

-type f -executable -name *.exe -o ! -a

Which, importantly, erases the parentheses. This comes in as an argv array, so it’s already tokenized for us by the shell or runtime. The classic shunting-yard algorithm solves this problem easily enough. We have an output queue that goes into the compiler, and a token stack for tracking -a, -o, !, and (. Then we walk argv in order:

Actions go straight into the output queue.
If we see one of the special stack tokens we push it onto the stack, first popping operators with greater precedence into the queue, stopping at (.
If we see ) we pop the stack into the output queue until we see (.

When we’re out of tokens, pop the remaining stack into the queue. My parser synthesizes -a where it’s implied, so the compiler always sees logical AND. If the expression contains no -exec, -ok, or -print, after processing is complete the parser puts -print then -a into the queue, which effectively wraps the whole expression in ( expr ) -print. By clearing the stack first, the real expression is effectively wrapped in parentheses, so no parenthesis tokens need to be synthesized.

I’ve used the shunting-yard algorithm many times before, so this part was easy. The new part was coming up with an algorithm to convert a series of postfix tokens into bytecode. My solution is the compiler maintains a stack of bytecode fragments. That is, each stack element is a sequence of one or more bytecode instructions. Branches use relative addresses, so they’re position-independent, and I can concatenate code fragments without any branch fix-ups. It takes the following actions from queue tokens:

For an action token, create an action instruction, and push it onto the fragment stack as a new fragment.
For a ! token, pop the top fragment, append a not instruction, and push it back onto the stack.
For a -a token, pop the top two fragments, join then with a braf in the middle which jumps just beyond the second fragment. That is, if the first fragment evaluates to false, skip over the second fragment into whatever follows.
For a -o token, just like -a but use brat. If the first fragment is true, we skip over the second fragment.

If the expression is valid, at the end of this process the stack contains exactly one fragment. Append a halt instruction to this fragment, and that’s our program! If the final fragment contained a branch just beyond its end, this halt is that branch target. A few peephole optimizations and could probably be an optimal program for this instruction set.

Closures as Win32 window procedures

2025-12-12T19:52:10Z

Back in 2017 I wrote about a technique for creating closures in C using JIT-compiled wrapper. It’s neat, though rarely necessary in real programs, so I don’t think about it often. I applied it to qsort, which sadly accepts no context pointer. More practical would be working around insufficient custom allocator interfaces, to create allocation functions at run-time bound to a particular allocation region. I’ve learned a lot since I last wrote about this subject, and a recent article had me thinking about it again, and how I could do better than before. In this article I will enhance Win32 window procedure callbacks with a fifth argument, allowing us to more directly pass extra context. I’m using w64devkit on x64, but the everything here should work out-of-the-box with any x64 toolchain that speaks GNU assembly.

A window procedure has this prototype:

LRESULT Wndproc(
  HWND hWnd,
  UINT Msg,
  WPARAM wParam,
  LPARAM lParam,
);

To create a window we must first register a class with RegisterClass, which accepts a set of properties describing a window class, including a pointer to one of these functions.

    MyState *state = ...;

    RegisterClassA(&(WNDCLASSA){
        // ...
        .lpfnWndProc   = my_wndproc,
        .lpszClassName = "my_class",
        // ...
    });

    HWND hwnd = CreateWindowExA("my_class", ..., state);

The thread drives a message pump with events from the operating system, dispatching them to this procedure, which then manipulates the program state in response:

    for (MSG msg; GetMessageW(&msg, 0, 0, 0);) {
        TranslateMessage(&msg);
        DispatchMessageW(&msg);  // calls the window procedure
    }

All four WNDPROC parameters are determined by Win32. There is no context pointer argument. So how does this procedure access the program state? We generally have two options:

Global variables. Yucky but easy. Frequently seen in tutorials.
A GWLP_USERDATA pointer attached to the window.

The second option takes some setup. Win32 passes the last CreateWindowEx argument to the window procedure when the window created, via WM_CREATE. The procedure attaches the pointer to its window as GWLP_USERDATA. This pointer is passed indirectly, through a CREATESTRUCT. So ultimately it looks like this:

    case WM_CREATE:
        CREATESTRUCT *cs = (CREATESTRUCT *)lParam;
        void *arg = (struct state *)cs->lpCreateParams;
        SetWindowLongPtr(hwnd, GWLP_USERDATA, (LONG_PTR)arg);
        // ...

In future messages we can retrieve it with GetWindowLongPtr. Every time I go through this I wish there was a better way. What if there was a fifth window procedure parameter though which we could pass a context?

typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);

We’ll build just this as a trampoline. The x64 calling convention passes the first four arguments in registers, and the rest are pushed on the stack, including this new parameter. Our trampoline cannot just stuff the extra parameter in the register, but will actually have to build a stack frame. Slightly more complicated, but barely so.

Allocating executable memory

In previous articles, and in the programs where I’ve applied techniques like this, I’ve allocated executable memory with VirtualAlloc (or mmap elsewhere). This introduces a small challenge for solving the problem generally: Allocations may be arbitrarily far from our code and data, out of reach of relative addressing. If they’re further than 2G apart, we need to encode absolute addresses, and in the simple case would just assume they’re always too far apart.

These days I’ve more experience with executable formats, and allocation, and I immediately see a better solution: Request a block of writable, executable memory from the loader, then allocate our trampolines from it. Other than being executable, this memory isn’t special, and allocation works the usual way, using functions unaware it’s executable. By allocating through the loader, this memory will be part of our loaded image, guaranteed to be close to our other code and data, allowing our JIT compiler to assume a small code model.

There are a number of ways to do this, and here’s one way to do it with GNU-styled toolchains targeting COFF:

        .section .exebuf,"bwx"
        .globl exebuf
exebuf:	.space 1<<21

This assembly program defines a new section named .exebuf containing 2M of writable ("w"), executable ("x") memory, allocated at run time just like .bss ("b"). We’ll treat this like an arena out of which we can allocate all trampolines we’ll probably ever need. With careful use of .pushsection this could be basic inline assembly, but I’ve left it as a separate source. On the C side I retrieve this like so:

typedef struct {
    char *beg;
    char *end;
} Arena;

Arena get_exebuf()
{
    extern char exebuf[1<<21];
    Arena r = {exebuf, exebuf+sizeof(exebuf)};
    return r;
}

Unfortunately I have to repeat myself on the size. There are different ways to deal with this, but this is simple enough for now. I would have loved to define the array in C with the GCC section attribute, but as is usually the case with this attribute, it’s not up to the task, lacking the ability to set section flags. Besides, by not relying on the attribute, any C compiler could compile this source, and we only need a GNU-style toolchain to create the tiny COFF object containing exebuf.

While we’re at it, a reminder of some other basic definitions we’ll need:

#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

Str clone(Arena *a, Str s)
{
    Str r = s;
    r.data = new(a, r.len, char);
    memcpy(r.data, s.data, (size_t)r.len);
    return r;
}

Which have been discussed at length in previous articles.

Trampoline compiler

From here the plan is to create a function that accepts a Wndproc5 and a context pointer to bind, and returns a classic WNDPROC:

WNDPROC make_wndproc(Arena *, Wndproc5, void *arg);

Our window procedure now gets a fifth argument with the program state:

LRESULT my_wndproc(HWND, UINT, WPARAM, LPARAM, void *arg)
{
    MyState *state = arg;
    // ...
}

When registering the class we wrap it in a trampoline compatible with RegisterClass:

    RegisterClassA(&(WNDCLASSA){
        // ...
        .lpfnWndProc   = make_wndproc(a, my_wndproc, state),
        .lpszClassName = "my_class",
        // ...
    });

All windows using this class will readily have access to this state object through their fifth parameter. It turns out setting up exebuf was the more complicated part, and make_wndproc is quite simple!

WNDPROC make_wndproc(Arena *a, Wndproc5 proc, void *arg)
{
    Str thunk = S(
        "\x48\x83\xec\x28"      // sub   $40, %rsp
        "\x48\xb8........"      // movq  $arg, %rax
        "\x48\x89\x44\x24\x20"  // mov   %rax, 32(%rsp)
        "\xe8...."              // call  proc
        "\x48\x83\xc4\x28"      // add   $40, %rsp
        "\xc3"                  // ret
    );
    Str r   = clone(a, thunk);
    int rel = (int)((uintptr_t)proc - (uintptr_t)(r.data + 24));
    memcpy(r.data+ 6, &arg, sizeof(arg));
    memcpy(r.data+20, &rel, sizeof(rel));
    return (WNDPROC)r.data;
}

The assembly allocates a new stack frame, with callee shadow space, and with room for the new argument, which also happens to re-align the stack. It stores the new argument for the Wndproc5 just above the shadow space. Then calls into the Wndproc5 without touching other parameters. There are two “patches” to fill out, which I’ve initially filled with dots: the context pointer itself, and a 32-bit signed relative address for the call. It’s going to be very near the callee. The only thing I don’t like about this function is that I’ve manually worked out the patch offsets.

It’s probably not useful, but it’s easy to update the context pointer at any time if hold onto the trampoline pointer:

void set_wndproc_arg(WNDPROC p, void *arg)
{
    memcpy((char *)p+6, &arg, sizeof(arg));
}

So, for instance:

    MyState *state[2] = ...;  // multiple states
    WNDPROC proc = make_wndproc(a, my_wndproc, state[0]);
    // ...
    set_wndproc_arg(proc, state[1]);  // switch states

Though I expect the most common case is just creating multiple procedures:

    WNDPROC procs[] = {
        make_wndproc(a, my_wndproc, state[0]),
        make_wndproc(a, my_wndproc, state[1]),
    };

To my slight surprise these trampolines still work with an active Control Flow Guard system policy. Trampolines do not have stack unwind entries, and I thought Windows might refuse to pass control to them.

Here’s a complete, runnable example if you’d like to try it yourself: main.c and exebuf.s

Better cases

This is more work than going through GWLP_USERDATA, and real programs have a small, fixed number of window procedures — typically one — so this isn’t the best example, but I wanted to illustrate with a real interface. Again, perhaps the best real use is a library with a weak custom allocator interface:

typedef struct {
    void *(*malloc)(size_t);   // no context pointer!
    void  (*free)(void *);     // "
} Allocator;

void *arena_malloc(size_t, Arena *);

// ...

    Allocator perm_allocator = {
        .malloc = make_trampoline(exearena, arena_malloc, perm);
        .free   = noop_free,
    };
    Allocator scratch_allocator = {
        .malloc = make_trampoline(exearena, arena_malloc, scratch);
        .free   = noop_free,
    };

Something to keep in my back pocket for the future.

Hierarchical field sort with string interning

2025-09-24T17:11:32Z

In a recent, real world problem I needed to load a heterogeneous sequence of records from a buffer. Record layout is defined in a header before the sequence. Each field is numeric, with a unique name composed of non-empty alphanumeric period-delimited segments, where segments signify nested structure. Field names are a comma-delimited list, in order of the record layout. The catch motivating this article is that nested structures are not necessarily contiguous. In my transformed representation I needed nested structures to be contiguous. For illustrative purposes here, it will be for JSON output. I came up with what I think is an interesting solution, which I’ve implemented in C using techniques previously discussed.

The above description is probably confusing on its own, and an example is worth a thousand words, so here’s a listing naming 7 fields:

timestamp,point.x,point.y,foo.bar.z,point.z,foo.bar.y,foo.bar.x

Where point is a substructure, as is foo and bar, but note they’re interleaved in the record. So if a record contains these values:

{1758158348, 1.23, 4.56, -100, 7.89, -200, -300}

The JSON representation would look like:

{
  "timestamp": 1758158348,
  "point": {
    "x": 1.23,
    "y": 4.56,
    "z": 7.89
  },
  "foo": {
    "bar": {
      "z": -100,
      "y": -200,
      "x": -300
    }
  }
}

Notice point.z moved up and foo.bar.z down, so that substructures are contiguous in this representation as required for JSON. Sorting the field names lexicographically would group them together as a simple solution. However, as an additional constraint I want to retain the original field order as much as possible. For example, timestamp is first in both the original and JSON representations, but sorting would put it last. If all substructures are already contiguous, nothing should change.

Solution with string interning

My solution is to intern the segment strings, assigning each a unique, monotonic integral token in the order they’re observed. In my program, zero is reserved as a special “root” token, and so the first string has the value 1. The concrete values aren’t important, only that they’re assigned monotonically.

The trick is that a string is always interned in the “namespace” of a previous token. That is, we’re building a (token, string) -> token map. For our segments that namespace is the token for the parent structure, and the top-level fields are interned in the reserved “root” namespace. When applied to the example, we get the token sequences:

timestamp  -> 1
point.x    -> 2 3
point.y    -> 2 4
foo.bar.z  -> 5 6 7
point.z    -> 2 8
foo.bar.y  -> 5 6 9
foo.bar.x  -> 5 6 10

And our map looks like:

{0, "timestamp"} -> 1
{0, "point"}     -> 2
{2, "x"}         -> 3
{2, "y"}         -> 4
{0, "foo"}       -> 5
{5, "bar"}       -> 6
{6, "z"}         -> 7
{2, "z"}         -> 8
{6, "y"}         -> 9
{6, "x"}         -> 10

Notice how "x" is assigned 3 and 10 due to different namespaces. That’s important because otherwise the fields of foo.bar would sort in the same order as point. Namespace gives these fields unique identities.

Once we have the token representation, sort lexicographically by token. That pulls point.z up to its siblings.

timestamp  -> 1
point.x    -> 2 3
point.y    -> 2 4
point.z    -> 2 8
foo.bar.z  -> 5 6 7
foo.bar.y  -> 5 6 9
foo.bar.x  -> 5 6 10

Now we have the “output” order with minimal re-ordering. If substructures were already contiguous, nothing changes. Assuming a reasonable map, this is O(n log n), primarily due to sorting.

Alternatives

Before I thought of namespaces, my initial idea was to intern the whole prefix of a segment. The sequence of look-ups would be:

"timestamp"    -> 1  -> {1}
"point"        -> 2
"point.x"      -> 3  -> {2, 3}
"point"        -> 2
"point.y"      -> 4  -> {2, 4}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.z"    -> 7  -> {5, 6, 7}
"point"        -> 2
"point.z"      -> 8  -> {2, 8}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.y"    -> 9  -> {5, 6, 9}
"foo"          -> 5
"foo.bar"      -> 6
"foo.bar.x"    -> 10 -> {5, 6, 10}

Ultimately it produces the same tokens, and this is a more straightforward string -> string map. The prefixes are acting as namespaces. However, I wrote it this way as a kind of visual proof: Notice the right triangle shape formed by the strings for each field. From the area we can see that processing prefixes as strings is O(n^2) quadratic time on the number of segments! In my real problem the inputs were never large enough for this to matter, but I hate leaving behind avoidable quadratic algorithms. Using a token as a namespace flattens the prefix to a constant size.

Another option is a different map for each namespace. So for foo.bar.z lookup the "foo" map (string -> map) in the root (string -> map), then within that lookup the "bar" table (string -> token) (since this is the penultimate segment), then intern "z" within that to get its token. That wouldn’t have quadratic time complexity, but it seems quite a bit more complicated than a single, flat (token, string) -> token map.

Implementation in C

Because the standard library has little useful for us, I am building on previously-established definitions, so refer to that article for basic definitions like Str. To start off, tokens will be a size-typed integer so we never need to worry about overflowing the token counter. We’d run out of memory first:

typedef ptrdiff Token;

We’re building a (token, string) -> token) map, so we’ll need a hash function for such keys:

uint64_t hash(Token t, Str s)
{
    uint64_t r = (uint64_t)t << 8;
    for (ptrdiff i = 0; i < s.len; i++) {
        r ^= s.data[i];
        r *= 1111111111111111111u;
    }
    return r;
}

The map itself is a forever-useful hash trie.

typedef struct Map Map;
struct Map {
    Map  *child[4];
    Token namespace;
    Str   segment;
    Token token;
};

Token *upsert(Map **m, Token namespace, Str segment, Arena *a)
{
    for (uint64_t h = hash(ns, segment); *m; h <<= 2) {
        if (namespace==(*m)->namespace && equals(segment, (*m)->segment)) {
            return &(*m)->token;
        }
        m = &(*m)->child[h>>62];
    }
    *m = new(a, 1, Map);
    (*m)->namespace = namespace;
    (*m)->segment = segment;
    return &(*m)->token;  // caller will assign
}

We’ll use this map to convert a string naming a field into a sequence of tokens, so we’ll need a slice. Fields also have an offset within the record and a type, which we’ll track via its original ordering, which I’ll do with an index field (e.g. into the original header). Also track the original name.

typedef struct {
    Str          name;
    ptrdiff_t    index;
    Slice(Token) tokens;
} Field;

To sort fields we’ll need a comparator:

ptrdiff_t field_compare(Field a, Field b)
{
    ptrdiff_t len = min(a.tokens.len, b.tokens.len);
    for (ptrdiff_t i = 0; i < len; i++) {
        Token d = a.tokens.data[i] - b.tokens.data[i];
        if (d) {
            return d;
        }
    }
    return a.tokens.len - b.tokens.len;
}

Because field names are unique, each token sequence is unique, and so we need not use index in the comparator.

Finally down to business: cut up the list and build the token sequences with the established push macro. The sort function isn’t interesting, and could be as simple as libc qsort with the above comparator (and adapter), so I’m only listing the prototype.

void field_sort(Slice(Field), Arena scratch);

Slice(Field) parse_fields(Str fieldlist, Arena *a)
{
    Slice(Field) fields  = {};
    Map         *strtab  = 0;
    ptrdiff_t    ntokens = 0;

    for (Cut c = {.tail=fieldlist, .ok=true}; c.ok;) {
        c = cut(c.tail, ',');
        Field field = {};
        field.name  = c.head;
        field.index = fields.len;

        Token prev = 0;
        for (Cut f = {.tail=field.name, .ok=true}; f.ok;) {
            f = cut(f.tail, '.');
            Token *token = upsert(&strtab, prev, f.head, a);
            if (!*token) {
                *token = ++ntokens;
            }
            *push(a, &field.tokens) = *token;
            prev = *token;
        }

        *push(a, &fields) = field;
    }

    field_sort(fields, *a);
    return fields;
}

Usage here suggests Cut::ok should be inverted to Cut::done so that it better zero-initializes. Something I’ll need to consider. Because it’s all allocated from an arena, no need for destructors or anything like that, so this is the complete implementation. Back to the example:

    Str fieldlist = S(
        "timestamp,"
        "point.x,"
        "point.y,"
        "foo.bar.z,"
        "point.z,"
        "foo.bar.y,"
        "foo.bar.x"
    );
    Slice(Field) fields = parse_fields(fieldlist, &scratch);
    for (ptrdiff_t i = 0; i < fields.len; i++) {
        Str name = fields.data[i].name;
        fwrite(name.data, 1, name.len, stdout);
        putchar('\n');
    }

This program will print the proper output field order. In a real program we’d hold onto the string table, define an inverse lookup to translate tokens back into strings, and use it when in producing output. I do just that in my exploratory program, rec2json.c, written a little differently than presented above. It uses the sorted tokens to compile a simple bytecode program that, when run against a record, produces its JSON representation. It compiles the example to:

OPEN          # print '{'
KEY     1     # print token 1 as a key, i.e. "timestamp:"
READ    0     # print double at record offset 0
COMMA         # print ','
KEY     2     # print token 2 as a key, i.e. "point:"
OPEN
KEY     3
READ    8     # print double at record offset 8
COMMA
KEY     4
READ    16
COMMA
KEY     8
READ    32
CLOSE         # print '}'
COMMA
KEY     5
OPEN
KEY     6
OPEN
KEY     7
READ    24
COMMA
KEY     9
READ    40
COMMA
KEY     10
READ    48
CLOSE
CLOSE
CLOSE

Seeing it written out, I notice more room for improvement. An optimization pass could coalesce instructions so that, for instance, OPEN then KEY concatenate to a single string at compile time so that it only needs one instruction. This program could be 15 instructions instead of 31. In my real case I didn’t need anything quite this sophisticated, but it was fun to explore.

Parameterized types in C using the new tag compatibility rule

2025-06-26T23:49:53Z

C23 has a new rule for struct, union, and enum compatibility finally appearing in compilers starting with GCC 15, released this past April, and Clang later this year. The same struct defined in different translation units (TU) has always been compatible — essential to how they work. Until this rule change, each such definition within a TU was a distinct, incompatible type. The new rule says that, ackshually, they are compatible! This unlocks some type parameterization using macros.

How can a TU have multiple definitions of a struct? Scope. Prior to C23 this wouldn’t compile because the compound literal type and the return type were distinct types:

struct Example { int x, y, z; };

struct Example example(void)
{
    struct Example { int x, y, z; };
    return (struct Example){1, 2, 3};
}

Otherwise the definition of struct Example within example was fine, if strange. At first this may not seem like a big deal, but let’s revisit my technique for dynamic arrays:

typedef struct {
    T        *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} SliceT;

Where I write out one of these for each T that I might want to put into a slice. With the new rule we can change it slightly, taking note of the introduction of a tag (the name after struct):

#define Slice(T)        \
    struct Slice##T {   \
        T        *data; \
        ptrdiff_t len;  \
        ptrdiff_t cap;  \
    }

This makes the “write it out ahead of time” thing simpler, but with the new rule we can skip the “ahead of time” part and conjure slice types on demand. Each declaration with the same T is compatible with the others due to matching tags and fields. So, for example, with this macro we can declare functions using slices parameterized for different element types.

Slice(int) range(int, Arena *);

float mean(Slice(float));

Slice(Str) split(Str, char delim, Arena *);
Str join(Slice(Str), char delim, Arena *);

Or using it with our model parser:

typedef struct {
    float x, y, z;
} Vec3;

typedef struct {
    int32_t v[3];
    int32_t n[3];
} Face;

typedef struct {
    Slice(Vec3) verts;
    Slice(Vec3) norms;
    Slice(Face) faces;
} Model;

typedef Slice(Vec3) Polygon;

I worried these macros might confuse my tools, particularly Universal Ctags because it’s important to me. Everything handles prototypes better than expected, but ctags doesn’t see fields with slice types. Overall they’re like a very limited form of C++ templates. Though only the types are parameterized, not the functions operating on those types. Outside of unwarranted macro abuse, this new technique does nothing regarding generic functions. On the other hand, my generic slice function complements the new technique, especially with the help of C23’s new typeof to mitigate _Alignof’s limitations:

typedef struct { char *beg, *end; } Arena;
void *alloc(Arena *, ptrdiff_t count, int size, int align);

#define push(a, s)                          \
  ((s)->len == (s)->cap                     \
    ? (s)->data = push_(                    \
        (a),                                \
        (s)->data,                          \
        &(s)->cap,                          \
        sizeof(*(s)->data),                 \
        _Alignof(typeof(*(s)->data))        \
      ),                                    \
      (s)->data + (s)->len++                \
    : (s)->data + (s)->len++)

void *push_(Arena *a, void *data, ptrdiff_t *pcap, int size, int align)
{
    ptrdiff_t cap = *pcap;

    if (a->beg != (char *)data + cap*size) {
        void *copy = alloc(a, cap, size, align);
        memcpy(copy, data, cap*size);
        data = copy;
    }

    ptrdiff_t extend = cap ? cap : 4;
    alloc(a, extend, size, align);
    *pcap = cap + extend;
    return data;
}

This exploits the fact that implementations adopting the new tag rule also have the upcoming C2y null pointer rule (note: also requires a cooperating libc). Putting it together, now I can write stuff like this:

Slice(int64_t) generate_primes(int64_t limit, Arena *a)
{
    Slice(int64_t) primes = {};

    if (limit > 2) {
        *push(a, &primes) = 2;
    }

    for (int64_t n = 3; n < limit; n += 2) {
        bool valid = true;
        for (ptrdiff_t i = 0; valid && i<primes.len; i++) {
            valid = n % primes.data[i];
        }
        if (valid) {
            *push(a, &primes) = n;
        }
    }

    return primes;
}

But it doesn’t take long to run into limitations. It makes little sense to define, say, a Map(K, V) without a generic function to manipulate it. This also doesn’t work:

typedef struct {
    Slice(Str)          names;
    Slice(Slice(float)) edges;
} Graph;

Due to Slice##T in the macro, required to establish a unique tag for each element type. The parameter to the macro must be an identifier, so you have to build up to it (or define another macro), which sort of defeats the purpose, which was entirely about convenience.

typedef Slice(float) Edges;

typedef struct {
    Slice(Str)   names;
    Slice(Edges) edges;
} Graph;

The benefits are small enough that perhaps it’s not worth the costs, but it’s been at least worth investigating. I’ve written a small demo of the technique if you’d like to see it in action, or test the abilities of your local C implementation: demo.c

WebAssembly: How to allocate your allocator

2025-04-19T03:18:20Z

An early, small hurdle diving into WebAssembly was allocating my allocator. On a server or desktop with virtual memory, the allocator asks the operating system to map fresh pages into its address space (sbrk, anonymous mmap, VirtualAlloc), which it then dynamically allocates to different purposes. In an embedded context, dynamic allocation memory is typically a fixed, static region chosen at link time. The Wasm execution environment more resembles an embedded system, but both kinds of obtaining raw memory are viable and useful in different situations.

For the purposes of this discussion, the actual allocator isn’t important. It could be a simple arena allocator, or a more general purpose buddy allocator. It could even be garbage collected with Boehm GC. Though WebAssembly’s linear memory is a poor fit for such a conservative garbage collector. In a compact address space starting at zero, and which doesn’t include code, memory addresses will be small numbers, and less distinguishable from common integer values. There’s also the issue that the garbage collector cannot scan the Wasm stack, which is hidden from Wasm programs by design. Only the ABI stack is visible. So a garbage collector requires cooperation from the compiler — essentially as a distinct calling convention — to spill all heap pointers on the ABI stack before function calls. Wasm C and C++ toolchains do not yet support this in a practical capacity.

Exporting a static heap

Let’s start with the embedded case because it’s simpler, and reserve a dynamic memory region at link time. WebAssembly has just reached 8 years old, so it’s early, and as we keep discovering, Wasm tooling is still immature. wasm-ld doesn’t understand linker scripts, and there’s no stable, low-level assembly language on which to build, e.g. to reserve space, define symbols, etc. WAT is too high level and inflexible for this purpose, as we’ll soon see. So our only option is to brute force it in a high-level language:

char heap[16<<20];  // 16MiB

Plugging it into an arena allocator:

typedef struct {
    char *beg;
    char *end;
} Arena;

Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    a.end = heap + sizeof(heap);
    return a;
}

Unfortunately heap isn’t generic memory, but a high-level variable with a specific, fixed type. That’s why it would have been nice to reserve the memory outside the high-level language. In practice this works fine so long as everything is aligned, but strictly speaking, allocating any variable except char from this arena involves incompatible loads and stores on a char array. Clang doesn’t document any inline assembly interface for Wasm, but neither does Clang forbid it. That leaves just enough room to launder the pointer if you’re worried about this technicality:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    asm ("" : "+r"(a.beg));  // launder
    a.end = a.beg + sizeof(heap);
    return a;
}

The +r means a.beg is both input and output. The address of the heap goes into the black box, and as far as the compiler is concerned, some mystery address comes out which, critically, has no effective type. The assembly block is empty (""), so it’s just a no-op, and we know (wink wink) it’s really the same address. Because the heap was “used” by the black box, Clang won’t optimize the heap out of existence beneath us. Also note that a.end was derived from the laundered pointer.

Update: The next C standard will improve this situation and so pointer laundering will no longer be unnecessary. A straight char array could be used as an arena.

This static variable technique works well only in an exported memory configuration, which is what wasm-ld uses by default. When a module exports its memory, it indicates how much linear memory it requires on start, and the Wasm runtime allocates and zero-initializes it at module initialization time. C and C++ toolchains depend on that runtime zeroing to initialize static and global variables, which are defined to be so initialized. Compilers generate code assuming these variables are zero initialized. This same paradigm is used for .bss sections in hosted environments.

In an imported memory configuration, linear memory is uninitialized. The memory may be re-used from, say, a destroyed module without zeroing, and may contain arbitrary data. In that case, C and C++ toolchains must zero the memory explicitly. It could potentially be done with a memory.fill instruction in the start section, but LLVM does not support start sections. Instead it uses an active data segment — a chunk of data copied into linear memory by the Wasm runtime during initialization, before running the start function.

That is, when importing memory, LLVM actually stores all those zeros in the Wasm module so that the runtime can copy it into linear memory. Wasm has no built-in compression, so your Wasm module will be at least as large as your heap! Exporting or importing memory is determined at link-time, so at compile-time the compiler must assume the worst case. If you compile the example above, you get a 16MiB “object” file (in Wasm format):

$ clang --target=wasm32 -c -O example.c
$ du -h example.o
16.0M   example.o

The WAT version of this file is 48MiB — clearly unsuitable as a low-level assembler. If linking with exported memory, wasm-ld discards all-zero active data segments. If using an imported memory configuration, it’s copied into the final image, producing a huge Wasm image, though highly compressible. As a rule, avoid importing memory when using an LLVM toolchain. Regardless, large heaps created this way will have a significant compile-time cost.

$ time echo 'char heap[256<<20];' | clang --target=wasm32 -c -xc -
real    0m0.334s
user    0m0.013s
sys     0m0.262s

(If only Clang had some sort of “noinit” variable attribute in order to allow heap to be uninitialized…)

Growing a dynamic heap

Wasm programs can grow linear memory using an sbrk-like memory.grow instruction. It operates in quantities of pages (64kB), and returns the old memory size. Because memory starts at zero, the old memory size is also the base address of the new allocation. Clang provides access to this instruction via an undocumented built-in:

size_t __builtin_wasm_memory_grow(int, size_t);

The first parameter selects a memory because someday there might be more than one. From this built-in we can define sbrk:

void *sbrk(ptrdiff_t size)
{
    size_t npages = (size + 0xffffu) >> 16;  // round up
    size_t old    = __builtin_wasm_memory_grow(0, npages);
    if (old == -1ul) {
        return 0;
    }
    return (void *)(old << 16);
}

To which Clang compiles (note the memory.grow):

(func $sbrk (param i32) (result i32)
  (select
    (i32.const 0)
    (i32.shl
      (local.tee 0
        (memory.grow
          (i32.shr_u
            (i32.add
              (local.get 0)
              (i32.const 65535))
            (i32.const 16))))
      (i32.const 16))
    (i32.eq
      (local.get 0)
      (i32.const -1))))

Applying that to create an arena like before:

Arena newarena(ptrdiff_t cap)
{
    Arena a = {0};
    a.beg = sbrk(cap);
    if (a.beg) {
        a.end = a.beg + cap;
    }
    return a;
}

Now we can choose the size of the arena, and we can use this to create multiple arenas (e.g. permanent, scratch, etc.). We could even continue growing the last-created arena in-place when it’s full.

If there was no memory.grow instruction, it could be implemented as a request through an imported function. The embedder using the Wasm runtime can grow the memory on the module’s behalf in the same manner. But as that documentation indicates, either way growing the memory comes with a downside in the most common Wasm runtimes, browsers: It “detaches” the memory from references, which complicates its use for the embedder. If a Wasm module may grow its memory at any time, the embedder must reacquire the memory handle after every call. It’s not difficult, but it’s easy to forget, and mistakes are likely to go unnoticed until later.

Importing a dynamic heap

There’s a middle ground where a Wasm module imports a dynamic-sized heap. That is, linear memory beyond the module’s base initialization. This might be the case, for instance, in a programming competition, where contestants submit Wasm modules which must complete a task using the supplied memory. In that case we don’t reserve a static heap, so we’re not facing the storing-zeros issue. However, how do we “find” the memory? Linear memory layout will look something like so:

0 <-- stack | data | heap --> ?
|-----------------------------|

This diagram reflects the more sensible wasm-ld --stack-first layout, where the ABI stack overflows off the bottom end of memory. The heap is just excess memory beyond the data. To find the upper bound, Wasm has a memory.size instruction to query linear memory size, which again Clang provides as an undocumented built-in:

size_t __builtin_wasm_memory_size(int);

Like before, this returns the result in number of 64k pages. That’s the high end. How do we find the low end? Similar to __stack_pointer, the linker creates a __heap_base constant, which is the address delineating data and heap in the diagram above. To use it, we need to declare it:

extern char __heap_base[];

Notice how it’s an array, not a pointer. It doesn’t hold an address, it is an address. In an ELF context this would called an absolute symbol. That’s everything we need to find the bounds of the heap:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = __heap_base;
    a.end = (char *)(__builtin_wasm_memory_size(0) << 16);
    return a;
}

Then we continue forward using whatever memory the embedder deigned to provide. Hopefully it’s enough!

Lessons learned from my first dive into WebAssembly

2025-04-04T04:01:20Z

It began as a water sort puzzle solver, constructed similarly to my British Square solver. It was nearly playable, so I added a user interface with SDL2. My wife enjoyed it on her desktop, but wished to play on her phone. So then I needed to either rewrite it in JavaScript and hope the solver was still fast enough for real-time use, or figure out WebAssembly (Wasm). I succeeded, and now my game runs in browsers (source). Like before, next I ported my pkg-config clone to the Wasm System Interface (WASI), whipped up a proof-of-concept UI, and it too runs in browsers. Neither use a language runtime, resulting in little 8kB and 28kB Wasm binaries respectively. In this article I share my experiences and techniques.

Wasm is a specification defining an abstract stack machine with a Harvard architecture, and related formats. There are just four types, i32, i64, f32, and f64. It also has “linear” octet-addressable memory starting at zero, with no alignment restrictions on loads and stores. Address zero is a valid, writable address, which resurfaces some, old school, high level language challenges regarding null pointers. There are 32-bit and 64-bit flavors, though the latter remains experimental. That suits me: I appreciate smaller pointers on 64-bit hosts, and I wish I could opt into it more often (e.g. x32).

As browser tech goes, they chose an apt name: WebAssembly is to the web as JavaScript is to Java.

There are distinct components at play, and much of the online discussion doesn’t do a great job drawing lines between them:

Wasm module: A compiled and linked image — like ELF or PE — containing sections for code, types, globals, import table, export table, and so on. The export table lists the module’s entry points. It has an optional start section indicating which function initializes a loaded image. (In practice almost nobody actually uses the start section.) A Wasm module can only affect the outside world through imported functions. Wasm itself defines no external interfaces for Wasm programs, not even printing or logging.
Wasm runtime: Loads Wasm modules, linking import table entries into the module. Because Wasm modules include types, the runtime can type check this linkage at load time. With imports resolved, it executes the start function, if any, then executes zero or more of its entry points, which hopefully invokes import functions such a way as to produce useful results, or perhaps simply return useful outputs.
Wasm compiler: Converts a high-level language to low-level Wasm. In order to do so, it requires some kind of Application Binary Interface (ABI) to map the high-level language concepts onto the machine. This typically introduces additional execution elements, and it’s important that we distinguish them from the abstract machine’s execution elements. Clang is the only compiler we’ll be discussing in this article, though there are many. During compilation the function indices are yet unknown and so references will need to be patched in by a linker.
Wasm linker: Settles the shape of the Wasm module and links up the functions emitted by the compiler. LLVM comes with wasm-ld, and it goes hand-in-hand with Clang as a compiler.
Language runtime: Unless you’re hand-writing raw Wasm, your high-level language probably has a standard library with operating system interfaces. C standard library, POSIX interfaces, etc. This runtime likely maps onto some standardized set of imports, most likely the aforementioned WASI, which defines a set of POSIX-like functions that Wasm modules may import. Because I think we could do better, as usual around here, in this article we’re going to eschew the language runtime and code directly against raw WASI. You still have easy access hash tables and dynamic arrays.

A combination of compiler-linker-runtime is conventionally called a toolchain. However, because almost any Clang installation can target Wasm out-of-the-box, and we’re skipping the language runtime, you can compile any of programs discussed in this article, including my game, with nothing more than Clang (invoking wasm-ld implicitly). If you have a Wasm runtime, which includes your browser, you can run them, too! Though this article will mostly focus on WASI, and you’ll need a WASI-capable runtime to run those examples, which doesn’t include browsers (short of implementing the API with JavaScript).

I wasn’t particularly happy with the Wasm runtimes I tried, so I cannot enthusiastically recommend one. I’d love if I could point to one and say, “Use the same Clang to compile the runtime that you’re using to compile Wasm!” Alas, I had issues compiling, the runtime was buggy, or WASI was incomplete. However, wazero (Go) was the easiest for me to use and it worked well enough, so I will use it in examples:

$ go install github.com/tetratelabs/wazero/cmd/wazero@latest

The Wasm Binary Toolkit (WABT) is good to have on hand when working with Wasm, particularly wasm2wat to inspect Wasm modules, sort of like objdump or readelf. It converts Wasm to the WebAssembly Text Format (WAT).

Learning Wasm I had quite some difficulty finding information. Outside of the Wasm specification, which, despite its length, is merely a narrow slice of the ecosystem, important technical details are scattered all over the place. Some is only available as source code, some buried comments in GitHub issues, and some lost behind dead links as repositories have moved. Large parts of LLVM are undocumented beyond an mention of existence. WASI has no documentation in a web-friendly format — so I have nothing to link from here when I mention its system calls — just some IDL sources in a Git repository. An old wasi.h was the most readable, complete source of truth I could find.

Fortunately Wasm is old enough that LLMs are well-versed in it, and simply asking questions, or for usage examples, was more effective than searching online. If you’re stumped on how to achieve something in the Wasm ecosystem, try asking a state-of-the-art LLM for help.

Example programs

Let’s go over concrete examples to lay some foundations. Consider this simple C function:

float norm(float x, float y)
{
    return x*x + y*y;
}

To compile to Wasm (32-bit) with Clang, we use the --target=wasm32:

$ clang -c --target=wasm32 -O example.c

The object file example.o is in Wasm format, so WABT can examine it. Here’s the output of wasm2wat -f, where -f produces output in the “folded” format, which is how I prefer to read it.

(module
  (type (;0;) (func (param f32 f32) (result f32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $norm (type 0) (param f32 f32) (result f32)
    (f32.add
      (f32.mul
        (local.get 0)
        (local.get 0))
      (f32.mul
        (local.get 1)
        (local.get 1)))))

We can see the ABI taking shape: Clang has predictably mapped float into f32. It similarly maps char, short, int and long onto i32. In 64-bit Wasm, the Clang ABI is LP64 and maps long onto i64. There’s a also $norm function which takes two f32 parameters and returns an f32.

Getting a little more complex:

__attribute((import_name("f")))
void f(int *);

__attribute((export_name("example")))
void example(int x)
{
    f(&x);
}

The import_name function attribute indicates the module will not define it, even in another translation unit, and that it intends to import it. That is, wasm-ld will place it in the import table. The export_name function attribute indicates it’s an entry point, and so wasm-ld will list it in the export table. Linking it will make things a little clearer:

$ clang --target=wasm32 -nostdlib -Wl,--no-entry -O example.c

The -nostdlib is because we won’t be using a language runtime, and --no-entry to tell the linker not to implicitly export a function (default: _start) as an entry point. You might think this is connected with the Wasm start function, but wasm-ld does not support the start section at all! We’ll have use for an entry point later. The folded WAT:

(module $a.out
  (type (;0;) (func (param i32)))
  (import "env" "f" (func $f (type 0)))
  (func $example (type 0) (param i32)
    (local i32)
    (global.set $__stack_pointer
      (local.tee 1
        (i32.sub
          (global.get $__stack_pointer)
          (i32.const 16))))
    (i32.store offset=12
      (local.get 1)
      (local.get 0))
    (call $f
      (i32.add
        (local.get 1)
        (i32.const 12)))
    (global.set $__stack_pointer
      (i32.add
        (local.get 1)
        (i32.const 16))))
  (table (;0;) 1 1 funcref)
  (memory (;0;) 2)
  (global $__stack_pointer (mut i32) (i32.const 66560))
  (export "memory" (memory 0))
  (export "example" (func $example)))

There’s a lot to unfold:

Pointers were mapped onto i32. Pointers are a high-level concept, and linear memory is addressed by an integral offset. This is typical of assembly after all.
There’s now a __stack_pointer, which is part of the Clang ABI, not Wasm. The Wasm abstract machine is a stack machine, but that stack doesn’t exist in linear memory. So you cannot take the address of values on the Wasm stack. There are lots of things C needs from a stack that Wasm doesn’t provide. So, in addition to the Wasm stack, Clang maintains another downward-growing stack in linear memory for these purposes, and the __stack_pointer global is the stack register of its ABI. We can see it’s allocated something like 64kB for the stack. (It’s a little more because program data is placed below the stack.)
It should be mostly readable without knowing Wasm: The function subtracts a 16-byte stack frame, stores a copy of the argument in it, then uses its memory offset for the first parameter to the import f. Why 16 bytes when it only needs 4? Because the stack is kept 16-byte aligned. Before returning, the function restores the stack pointer.

As mentioned earlier, address zero is valid as far as the Wasm runtime is concerned, though dereferences are still undefined in C. This makes it more difficult to catch bugs. Given a null pointer this function would most likely read a zero at address zero and the program keeps running:

int get(int *p)
{
    return *p;
}

In WAT:

(func $get (type 0) (param i32) (result i32)
  (i32.load
    (local.get 0)))

Since the “hardware” won’t fault for us, ask Clang to do it instead:

$ clang ... -fsanitize=undefined -fsanitize-trap ...

Now in WAT:

(module
  (type (;0;) (func (param i32) (result i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $get (type 0) (param i32) (result i32)
    (block  ;; label = @1
      (block  ;; label = @2
        (br_if 0 (;@2;)
          (i32.eqz
            (local.get 0)))
        (br_if 1 (;@1;)
          (i32.eqz
            (i32.and
              (local.get 0)
              (i32.const 3)))))
      (unreachable))
    (i32.load
      (local.get 0))))

Given a null pointer, get executes the unreachable instruction, causing the runtime to trap. In practice this is unrecoverable. Consider: nothing will restore __stack_pointer, and so the stack will “leak” the existing frames. (This can be worked around by exporting __stack_pointer and __stack_high via the --export linker flag, then restoring the stack pointer in the runtime after traps.)

Wasm was extended with bulk memory operations, and so there are single instructions for memset and memmove, which Clang maps onto the built-ins:

void clear(void *buf, long len)
{
    __builtin_memset(buf, 0, len);
}

(Below LLVM 20 you will need the undocumented -mbulk-memory option.) In WAT we see this as memory.fill:

(module
  (type (;0;) (func (param i32 i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $clear (type 0) (param i32 i32)
    (block  ;; label = @1
      (br_if 0 (;@1;)
        (i32.eqz
          (local.get 1)))
      (memory.fill
        (local.get 0)
        (i32.const 0)
        (local.get 1)))))

That’s great! I wish this worked so well outside of Wasm. It’s one reason w64devkit has -lmemory, after all. Similarly __builtin_trap() maps onto the unreachable instruction, so we can reliably generate those as well.

What about structures? They’re passed by address. Parameter structures go on the stack, then its address passed. To return a structure, a function accepts an implicit out parameter in which to write the return. This isn’t unusual, except that it’s challenging to manage across module boundaries, i.e. in imports and exports, because caller and callee are in different address spaces. It’s especially tricky to return a structure from an export, as the caller must somehow allocate space in the callee’s address space for the result. The multi-value extension solves this, but using it in C involves an ABI change, which is still experimental.

Water Sort Game

Something you might not have expected: My water sort game imports no functions! It only exports three functions:

void      game_init(i32 seed);
DrawList *game_render(i32 width, i32 height, i32 mousex, i32 mousey);
void      game_update(i32 input, i32 mousex, i32 mousey, i64 now);

The game uses IMGUI-style rendering. The caller passes in the inputs, and the game returns a kind of display list telling it what to draw. In the SDL version these turn into SDL renderer calls. In the web version, these turn into canvas draws, and “mouse” inputs may be touch events. It plays and feels the same on both platforms. Simple!

I didn’t realize it at the time, but building the SDL version first was critical to my productivity. Debugging Wasm programs is really dang hard! Wasm tooling has yet to catch up with 1995, let alone 2025. Source-level debugging is still experimental and impractical. Developing applications on the Wasm platform. It’s about as ergonomic as developing in MS-DOS. Instead, develop on a platform much better suited for it, then port your application to Wasm after you’ve got the issues worked out. The less Wasm-specific code you write, the better, even if it means writing more code overall. Treat it as you would some weird embedded target.

The game comes with 10,000 seeds. I generated ~200 million puzzles, sorted them by difficulty, and skimmed the top 10k most challenging. In the game they’re still sorted by increading difficulty, so it gets harder as you make progress.

Wasm System Interface

WASI allows us to get a little more hands on. Let’s start with a Hello World program. A WASI application exports a traditional _start entry point which returns nothing and takes no arguments. I’m also going to set up some basic typedefs:

typedef unsigned char       u8;
typedef   signed int        i32;
typedef   signed long long  i64;
typedef   signed long       iz;

void _start(void)
{
}

wasm-ld will automatically export this function, so we don’t need an export_name attribute. This program successfully does nothing:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm && echo ok
ok

To write output WASI defines fd_write():

typedef struct {
    u8 *buf;
    iz  len;
} IoVec;

#define WASI(s) __attribute((import_module("wasi_unstable"),import_name(s)))
WASI("fd_write")  i32  fd_write(i32, IoVec *, iz, iz *);

Technically those iz variables are supposed to be size_t, passed through Wasm as i32, but this is a foreign function, I know the ABI, and so I can do as I please. I absolutely love that WASI barely uses null-terminated strings, not even for paths, which is a breath of fresh air, but they still marred the API with unsigned sizes. Which I choose to ignore.

This function is shaped like POSIX writev(). I’ve also set it up for import, including a module name. The oldest, most stable version of WASI is called wasi_unstable. (I suppose it shouldn’t be surprising that finding information in this ecosystem is difficult.)

Every returning WASI function returns an errno value, with zero as success rather than some kind of in-band signaling. Hence the final out parameter unlike POSIX writev().

Armed with this function, let’s use it:

void _start(void)
{
    u8    msg[] = "hello world\n";
    IoVec iov   = {msg, sizeof(msg)-1};
    iz    len   = 0;
    fd_write(1, &iov, 1, &len);
}

Then:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm
hello world

Keep going and you’ll have something like printf before long. If the write fails, we should probably communicate the error with at least the exit status. Because _start doesn’t return a status, we need to exit, for which we have proc_exit. It doesn’t return, so no errno return value.

WASI("proc_exit") void proc_exit(i32);

void _start(void)
{
    // ...
    i32 err = fd_write(1, &iov, 1, &len);
    proc_exit(!!err);
}

To get the command line arguments, call args_sizes_get to get the size, allocate some memory, then args_get to read the arguments. Same goes for the environment with a similar pair of functions. The sizes do not include a null pointer terminator, which is sensible.

Now that you know how to find and use these functions, you don’t need me to go through each one. However, opening files is a special, complicated case:

WASI("path_open") i32 path_open(i32,i32,u8*,iz,i32,i64,i64,i32,i32*);

That’s 9 parameters — and I had thought Win32 CreateFileW was over the top. It’s even more complex than it looks. It works more like POSIX openat(), except there’s no current working directory and so no AT_FDCWD. Every file and directory is opened relative to another directory, and absolute paths are invalid. If there’s no AT_FDCWD, how does one open the first directory? That’s called a preopen and it’s core to the file system security mechanism of WASI.

The Wasm runtime preopens zero or more directories before starting the program and assigns them the lowest numbered file descriptors starting at file descriptor 3 (after standard input, output, and error). A program intending to use path_open must first traverse the file descriptors, probing for preopens with fd_prestat_get and retrieving their path name with fd_prestat_dir_name. This name may or may not map back onto a real system path, and so this is a kind of virtual file system for the Wasm module. The probe stops on the first error.

To open an absolute path, it must find a matching preopen, then from it construct a path relative to that directory. This part I much dislike, as the module must contain complex path parsing functionality even in the simple case. Opening files is the most complex piece of the whole API.

I mentioned before that program data is below the Clang stack. With the stack growing down, this sounds like a bad idea. A stack overflow quietly clobbers your data, and is difficult to recognize. More sensible to put the stack at the bottom so that it overflows off the bottom of memory and causes a fast fault. Fortunately there’s a switch for that:

$ clang --target=wasm32 ... -Wl,--stack-first ...

This is what you want by default. The actual default layout is left over from an early design flaw in wasm-ld, and it’s an oversight that it has not yet been corrected.

u-config

The above is in action in the u-config Wasm port. You can download the Wasm module, pkg-config.wasm, used in the web demo to run it in your favorite WASI-capable Wasm runtime:

$ wazero run pkg-config.wasm --modversion pkg-config
0.33.3

Though there are no preopens, so it cannot read any files. The -mount option maps real file system paths to preopens. This mounts the entire root file system read-only (ro) as /.

$ wazero run -mount /::ro pkg-config.wasm --cflags sdl2
-I/usr/include/SDL2 -D_REENTRANT

I doubt this is useful for anything, but it was a vehicle for learning and trying Wasm, and the results are pretty neat.

In the next article I discuss allocating the allocator.

A more robust raw OpenBSD syscall demo

2025-03-06T02:43:20Z

Ted Unangst published dude, where are your syscalls? on flak yesterday, with a neat demonstration of OpenBSD’s pinsyscall security feature, whereby only pre-registered addresses are allowed to make system calls. Whether it strengthens or weakens security is up for debate, but regardless it’s an interesting, low-level programming challenge. The original demo is fragile for multiple reasons, and requires manually locating and entering addresses for each build. In this article I show how to fix it. To prove that it’s robust, I ported an entire, real application to use raw system calls on OpenBSD.

The original program uses ARM64 assembly. I’m a lot more comfortable with x86-64 assembly, plus that’s the hardware I have readily on hand. So the assembly language will be different, but all the concepts apply to both these architectures. Almost none of these OpenBSD system interfaces are formally documented (or stable for that matter), and I had to dig around the OpenBSD source tree to figure it out (along with a helpful jart nudge). So don’t be afraid to get your hands dirty.

There are lots of subtle problems in the original demo, so let’s go through the program piece by piece, starting with the entry point:

void
start()
{
        w("hello\n", 6);
        x();
}

This function is registered as the entry point in the ELF image, so it has no caller. ~~That means no return address on the stack, so the stack is not aligned for a function.~~(Correction: The stack alignment issue is true for x86, but not ARM, so the original demo is fine.) In toy programs that goes unnoticed, but compilers generate code assuming the stack is aligned. In a real application this is likely to crash deep on the first SIMD register spill.

We could fix this with a force_align_arg_pointer attribute, at least for architectures that support it, but I prefer to write the entry point in assembly. Especially so we can access the command line arguments and environment variables, which is necessary in a real application. That happens to work the same as it does on Linux, so here’s my old, familiar entry point:

asm (
    "        .globl _start\n"
    "_start: mov   %rsp, %rdi\n"
    "        call  start\n"
);

Per the ABI, the first argument passes through rdi, so I pass a copy of the stack pointer, rsp, as it appeared on entry. Entry point arguments argc, argv, and envp are all pushed on the stack at rsp, so the first real function can retrieve it all from just the stack pointer. The original demo won’t use it, though. Using call to pass control pushes a return address, which will never be used, and aligns the stack for the first real function. I name it _start because that’s what the linker expects and so things will go a little smoother, so it’s rather convenient that the original didn’t use this name.

Next up, the “write” function:

int
w(void *what, size_t len) {
        __asm(
"       mov x2, x1;"
"       mov x1, x0;"
"       mov w0, #1;"
"       mov x8, #4;"
"       svc #0;"
        );
        return 0;
}

There are two serious problems with this assembly block. First, the function arguments are not necessarily in those registers by the time control reaches the basic assembly block. The function prologue could move them around. Even more so if this function was inlined. This is exactly the problem extended inline assembly is intended to solve. Second, it clobbers a number of registers. Compilers assume this does not happen when generating their own code. This sort of assembly falls apart the moment it comes into contact with a non-zero optimization level.

Solving this is just a matter of using inline assembly properly:

long w(void *what, long len)
{
    char err;
    long rax = 4;  // SYS_write
    asm volatile (
        "syscall"
        : "+a"(rax), "+d"(len), "=@ccc"(err)
        : "D"(1), "S"(what)
        : "rcx", "r11", "memory"
    );
    return err ? -rax : rax;
}

I’ve enhanced it a bit, returning a Linux-style negative errno on error. In the BSD ecosystem, syscall errors are indicated using the carry flag, which here is output into err via =@ccc. When set, the return value is an errno. Further, the OpenBSD kernel uses both rax and rdx for return values, so I’ve also listed rdx as an input+output despite not consuming the result. Despite all these changes, this function is not yet complete! We’ll get back to it later.

The “exit” function, x, is just fine:

void
x() {
        __asm(
"       mov x8, #1;"
"       svc #0;"
        );
}

It doesn’t set an exit status, so it passes garbage instead, but otherwise this works. No inputs, plus clobbers and outputs don’t matter when control never returns. In a real application I might write it:

__attribute((noreturn))
void x(int status)
{
    asm volatile ("syscall" :: "a"(1), "D"(status));
    __builtin_unreachable();
}

This function will need a little additional work later, too.

The ident section is basically fine as-is:

__asm(" .section \".note.openbsd.ident\", \"a\"\n"
"       .p2align 2\n"
"       .long   8\n"
"       .long   4\n"
"       .long   1\n"
"       .ascii \"OpenBSD\\0\"\n"
"       .long   0\n"
"       .previous\n");

The compiler assumes the current section remains the same at the end of the assembly block, which here is accomplished with .previous. Though it clobbers the assembler’s remembered “other” section and so may interfere with surrounding code using .previous. Better to use .pushsection and .popsection for good stack discipline. There are many such examples in the OpenBSD source tree.

asm (
    ".pushsection .note.openbsd.ident, \"a\"\n"
    ".long  8, 4, 1, 0x6e65704f, 0x00445342, 0\n"
    ".popsection\n"
);

Now the trickiest part, the pinsyscall table:

struct whats {
        unsigned int offset;
        unsigned int sysno;
} happening[] __attribute__((section(".openbsd.syscalls"))) = {
        { 0x104f4, 4 },
        { 0x10530, 1 },
};

Those offsets — offsets from the beginning of the ELF image — were entered manually, and it kind of ruins the whole demo. We don’t have a good way to get at those offsets from C, or any high level language. However, we can solve that by tweaking the inline assembly with some labels:

__attribute((noinline))
long w(void *what, long len)
{
    // ...
    asm volatile (
        "_w: syscall"
        // ...
    );
    // ...
}

__attribute((noinline,noreturn))
void x(int status)
{
    asm volatile (
        "_x: syscall"
        // ...
    );
    // ...
}

Very importantly I’ve added noinline to prevent these functions from being inlined into additional copies of the syscall instruction, which of course won’t be registered. This also prevents duplicate labels causing assembler errors. Once we have the labels, we can use them in an assembly block listing the allowed syscall instructions:

asm (
    ".pushsection .openbsd.syscalls\n"
    ".long  _x, 1\n"
    ".long  _w, 4\n"
    ".popsection\n"
);

That lets the linker solve the offsets problem, which is its main job after all. With these changes the demo works reliably, even under high optimization levels. I suggest these flags:

$ cc -static -nostdlib -no-pie -o where where.c

Disabling PIE with -no-pie is necessary in real applications or else strings won’t work. You can apply more flags to strip it down further, but these are the flags generally necessary to compile these sorts of programs on at least OpenBSD 7.6.

So, how do I know this stuff works in general? Because I ported my ultra portable pkg-config clone, u-config, to use raw OpenBSD syscalls: openbsd_main.c. Everything still works at high optimization levels.

$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c libmemory.a
$ ./pkg-config --cflags --libs libcurl
-I/usr/local/include -L/usr/local/lib -lcurl

Because the new syscall wrappers behave just like Linux system calls, it leverages the linux_noarch.c platform, and the whole port is ~70 lines of code. A few more flags (-fno-stack-protector, -Oz, -s, etc.), and it squeezes into a slim 21.6K static binary.

Despite making no libc calls, it’s not possible stop compilers from fabricating (hallucinating?) string function calls, so the build above depends on external definitions. In the command above, libmemory.a comes from libmemory.c found in w64devkit. Alternatively, and on topic, you could link the OpenBSD libc string functions by omitting libmemory.a from the build.

$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c

Though it pulls in a lot of bloat (~8x size increase), and teasing out the necessary objects isn’t trivial.

Robust Wavefront OBJ model parsing in C

2025-03-02T23:22:58Z

Wavefront OBJ is a line-oriented, text format for 3D geometry. It’s widely supported by modeling software, easy to parse, and trivial to emit, much like Netpbm for 2D image data. Poke around hobby 3D graphics projects and you’re likely to find a bespoke OBJ parser. While typically only loading their own model data, so robustness doesn’t much matter, they usually have hard limitations and don’t stand up to fuzz testing. This article presents a robust, partial OBJ parser in C with no hard-coded limitations, written from scratch. Like similar articles, it’s not really about OBJ but demonstrating some techniques you’ve probably never seen before.

If you’d like to see the ready-to-run full source: objrender.c. All images are screenshots of this program.

First let’s establish the requirements. By robust I mean no undefined behavior for any input, valid or invalid; no out of bounds accesses, no signed overflows. Input is otherwise not validated. Invalid input may load as valid by chance, which will render as either garbage or nothing. The behavior will also not vary by locale.

We’re also only worried about vertices, normals, and triangle faces with normals. In OBJ these are v, vn, and f elements. Normals let us light the model effectively while checking our work. A cube fitting this subset of OBJ might look like:

v  -1.00 -1.00 -1.00
v  -1.00 +1.00 -1.00
v  +1.00 +1.00 -1.00
v  +1.00 -1.00 -1.00
v  -1.00 -1.00 +1.00
v  -1.00 +1.00 +1.00
v  +1.00 +1.00 +1.00
v  +1.00 -1.00 +1.00

vn +1.00  0.00  0.00
vn -1.00  0.00  0.00
vn  0.00 +1.00  0.00
vn  0.00 -1.00  0.00
vn  0.00  0.00 +1.00
vn  0.00  0.00 -1.00

f   3//1  7//1  8//1
f   3//1  8//1  4//1
f   1//2  5//2  6//2
f   1//2  6//2  2//2
f   7//3  3//3  2//3
f   7//3  2//3  6//3
f   4//4  8//4  5//4
f   4//4  5//4  1//4
f   8//5  7//5  6//5
f   8//5  6//5  5//5
f   3//6  4//6  1//6
f   3//6  1//6  2//6

Take note:

Some fields are separated by more than one space.
Vertices and normals are fractional (floating point).
Faces use 1-indexing instead of 0-indexing.
Faces in this model lack a texture index, hence // (empty).

Inputs may have other data, but we’ll skip over it, including face texture indices, or face elements beyond the third. Some of the models I’d like to test have relative indices, so I want to support those, too. A relative index refers backwards from the last vertex, so the order of the lines in an OBJ matter. For example, the cube faces above could have instead been written:

f  -6//-6 -2//-6 -1//-6
f  -6//-6 -1//-6 -5//-6
f  -8//-5 -4//-5 -3//-5
f  -8//-5 -3//-5 -7//-5
f  -2//-4 -6//-4 -7//-4
f  -2//-4 -7//-4 -3//-4
f  -5//-3 -1//-3 -4//-3
f  -5//-3 -4//-3 -8//-3
f  -1//-2 -2//-2 -3//-2
f  -1//-2 -3//-2 -4//-2
f  -6//-1 -5//-1 -8//-1
f  -6//-1 -8//-1 -7//-1

Due to this the parser cannot be blind to line order, and it must handle negative indices. Relative indexing has the nice effect that we can group faces, and those groups are relocatable. We can reorder them without renumbering the faces, or concatenate models just by concatenating their OBJ files.

The fundamentals

To start off, we’ll be using an arena of course, trivializing memory management while swiping aside all hard-coded limits. A quick reminder of the interface:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;

// Always returns an aligned pointer inside the arena. Allocations are
// zeroed. Does not return on OOM (never returns a null pointer).
void *alloc(Arena *, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align);

Also, no null terminated strings, perhaps the main source of problems with bespoke parsers.

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

Pointer arithmetic is error prone, so the tricky stuff is relegated to a handful of functions, each of which can be exhaustively validated almost at a glance:

Str span(char *beg, char *end)
{
    Str r = {0};
    r.data = beg;
    r.len  = beg ? end-beg : 0;
    return r;
}

_Bool equals(Str a, Str b)
{
    return a.len==b.len && (!a.len || !memcmp(a.data, b.data, a.len));
}

Str trimleft(Str s)
{
    for (; s.len && *s.data<=' '; s.data++, s.len--) {}
    return s;
}

Str trimright(Str s)
{
    for (; s.len && s.data[s.len-1]<=' '; s.len--) {}
    return s;
}

Str substring(Str s, ptrdiff_t i)
{
    if (i) {
        s.data += i;
        s.len  -= i;
    }
    return s;
}

Each avoids the purposeless special cases around null pointers (i.e. zero-initialized Str objects) that would otherwise work out naturally. The space character and all control characters are treated as whitespace for simplicity. When I started writing this parser, I didn’t define all these functions up front. I defined them as needed. (A good standard library would have provided similar definitions out-of-the-box.) If you’re worried about misuse, add the appropriate assertions.

A powerful and useful string function I’ve discovered, and which I use in every string-heavy program, is cut, a concept I shamelessly stole from the Go standard library:

typedef struct {
    Str   head;
    Str   tail;
    _Bool ok;
} Cut;

Cut cut(Str s, char c)
{
    Cut r = {0};
    if (!s.len) return r;  // null pointer special case
    char *beg = s.data;
    char *end = s.data + s.len;
    char *cut = beg;
    for (; cut<end && *cut!=c; cut++) {}
    r.ok   = cut < end;
    r.head = span(beg, cut);
    r.tail = span(cut+r.ok, end);
    return r;
}

It slices, it dices, it juliennes! Need to iterate over lines? Cut it up:

    Cut c = {0};
    c.tail = input;
    while (c.tail.len) {
        c = cut(c.tail, '\n');
        Str line = c.head;
        // ... process line ...
    }

Need to iterate over the fields in a line? Cut the line on the field separator. Then cut the field on the element separator. No allocation, no mutation (strtok).

Reading input

Unlike a program designed to process arbitrarily large inputs, the intention here is to load the entire model into memory. We don’t need to fiddle around with loading a line of input at at time (fgets, getline, etc.) — the usual approach with OBJ parsers. If the OBJ source cannot fit in memory, then the model won’t fit in memory. This greatly simplifies the parser, not to mention faster while lifting hard-coded limits like maximum line length.

The simple arena I use makes whole-file loading so easy. Read straight into the arena without checking the file size (ftell, etc.), which means streaming inputs (i.e. pipes) work automatically.

Str loadfile(Arena *a, FILE *f)
{
    Str r  = {0};
    r.data = a->beg;
    r.len  = a->end - a->beg;
    r.len  = fread(r.data, 1, r.len, f);
    return r;
}

Without buffered input, you may need a loop around the read:

Str loadfile(Arena *a, int fd)
{
    Str r = {0};
    r.data = a.beg;
    ptrdiff_t cap = a->end - a->beg;
    for (;;) {
        ptrdiff_t r = read(fd, r.data+r.len, cap-r.len);
        if (r < 1) {
            return r;  // ignoring read errors
        }
        r.len += r;
    }
}

You might consider triggering an out-of-memory error if the arena was filled to the brim, which almost certainly means the input was truncated. Though that’s likely to happen anyway because the next allocation from that arena will fail.

Side note: When using a multi GB arena, issuing such huge read requests stress tests the underlying IO system. I’ve found libc bugs this way. In this case I used SDL2 for the demo, and SDL lost the ability to read files after I increased the arena size to 4GB in order to test a gigantic model (“Power Plant”). I’ve run into this before, and I assumed it was another Microsoft CRT bug. After investigating deeper for this article, I learned it’s an ancient SDL bug that’s made it all the way into SDL3. -Wconversion warns about it, but was accidentally squelched in the 64-bit port back in 2009. It seems nobody else loads files this way, so watch out for platform bugs if you use this technique!

Parsing data

In practice, rendering systems limit counts to the 32-bit range, which is reasonable. So in the OBJ parser, vertex and normal indices will be 32-bit integers. Negatives will be needed for at least relative indexing. Parsing from a Str means null-terminated functions like strtol are off limits. So here’s a function to parse a signed integer out of a Str:

int32_t parseint(Str s)
{
    uint32_t r    = 0;
    int32_t  sign = 1;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        default : r = 10*r + s.data[i] - '0';
        }
    }
    return r * sign;
}

The uint32_t means its free to overflow. If it overflows, the input was invalid. If it doesn’t hold an integer, the input was invalid. In either case it will read a harmless, garbage result. Despite being unsigned, it works just fine with negative inputs thanks to two’s complement.

For floats I didn’t intend to parse exponential notation, but some models I wanted to test actually did use it — probably by accident — so I added it anyway. That requires a function to compute the exponent.

float expt10(int32_t e)
{
    float   y = 1.0f;
    float   x = e<0 ? 0.1f : e>0 ? 10.0f : 1.0f;
    int32_t n = e<0 ? e : -e;
    for (; n < -1; n /= 2) {
        y *= n%2 ? x : 1.0f;
        x *= x;
    }
    return x * y;
}

That’s exponentiation by squaring, avoiding signed overflow on the exponent. Traditionally a negative exponent is inverted, but applying unary - to an arbitrary integer might overflow (consider -2147483648). So instead I iterate from the negative end. The negative range is larger than the positive, after all. Finally we can parse floats:

float parsefloat(Str s)
{
    float r    = 0.0f;
    float sign = 1.0f;
    float exp  = 0.0f;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        case '.': exp  =  1; break;
        case 'E':
        case 'e': exp  = exp ? exp : 1.0f;
                  exp *= expt10(parseint(substring(s, i+1)));
                  i    = s.len;
                  break;
        default : r = 10.0f*r + (s.data[i] - '0');
                  exp *= 0.1f;
        }
    }
    return sign * r * (exp ? exp : 1.0f);
}

Probably not as precise as strtof, but good enough for loading a model. It’s also ~30% faster for this purpose than my system’s strtof. If it hits an exponent, it combines parseint and expt10 to augment the result so far. At least for all the models I tried, the exponent only appeared for tiny values. They round to zero with no visible effects, so you can cut the implementation by more than half in one fell swoop if you wish (no more expt10 nor substring either):

        switch (s.data[i]) {
        // ...
        case 'E':
        case 'e': return 0;  // probably small *shrug*
        // ...
        }

Why not strtof? That has the rather annoying requirement that input is null terminated, which is not the case here. Worse, it’s affected by the locale and doesn’t behave consistently nor reliably.

A vertex is three floats separated by whitespace. So combine cut and parsefloat to parse one.

typedef struct {
    float v[3];
} Vert;

Vert parsevert(Str s)
{
    Vert r = {0};
    Cut c = cut(trimleft(s), ' ');
    r.v[0] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[1] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[2] = parsefloat(c.head);
    return r;
}

cut parses a field between every space, including empty fields between adjacent spaces, so trimleft discards extra space before cutting. If the line ends early, this passes empty strings into parsefloat which come out as zeros. No special checks required for invalid input.

Faces are a set of three vertex indices and three normal indices, and parses almost the same way. Relative indices are immediately converted to absolute indices using the number of vertices/normals so far.

typedef struct {
    int32_t v[3];
    int32_t n[3];
} Face;

static Face parseface(Str s, ptrdiff_t nverts, ptrdiff_t nnorms)
{
    Face r      = {0};
    Cut  fields = {0};
    fields.tail = s;
    for (int i = 0; i < 3; i++) {
        fields = cut(trimleft(fields.tail), ' ');
        Cut elem = cut(fields.head, '/');
        r.v[i] = parseint(elem.head);
        elem = cut(elem.tail, '/');  // skip texture
        elem = cut(elem.tail, '/');
        r.n[i] = parseint(elem.head);

        // Process relative subscripts
        if (r.v[i] < 0) {
            r.v[i] = (int32_t)(r.v[i] + 1 + nverts);
        }
        if (r.n[i] < 0) {
            r.n[i] = (int32_t)(r.n[i] + 1 + nnorms);
        }
    }
    return r;
}

Since nverts must be non-negative, and a relative index is negative by definition, adding them together can never overflow. If there are too many vertices, the result might be truncated, as indicated by the cast. That’s fine. Just invalid input.

There’s an interesting interview question here: Consider this alternative to the above, maintaining the explicit cast to dismiss the -Wconversion warning.

            r.v[i] += (int32_t)(1 + nverts);

Is it equivalent? Can this overflow? (Answers: No and yes.) If yes, under what conditions? Unfortunately a fuzz test would never hit it.

Putting it together

For this case, a model is three arrays of vertices, normals, and indices. While faces only support 32-bit indexing, I use ptrdiff_t in order to skip overflow checks. There cannot possibly be more vertices than bytes of source, so these counts cannot overflow.

typedef struct {
    Vert     *verts;
    ptrdiff_t nverts;
    Vert     *norms;
    ptrdiff_t nnorms;
    Face     *faces;
    ptrdiff_t nfaces;
} Model;

Model parseobj(Arena *, Str);

They’d probably look a little nicer as dynamic arrays, but we won’t need that machinery. That’s because the parser makes two passes over the OBJ source, the first time to count:

    Model m     = {0};
    Cut   lines = {0};

    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.nverts++;
        } else if (equals(S("vn"), kind)) {
            m.nnorms++;
        } else if (equals(S("f"), kind)) {
            m.nfaces++;
        }
    }

It’s a lightweight pass, skipping over the numeric data. With that information collected, we can allocate the model:

    m.verts  = new(a, m.nverts, Vert);
    m.norms  = new(a, m.nnorms, Vert);
    m.faces  = new(a, m.nfaces, Face);
    m.nverts = m.nnorms = m.nfaces = 0;

On the next pass we call parsevert and parseface to fill it out.

    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.verts[m.nverts++] = parsevert(fields.tail);
        } else if (equals(S("vn"), kind)) {
            m.norms[m.nnorms++] = parsevert(fields.tail);
        } else if (equals(S("f"), kind)) {
            m.faces[m.nfaces++] = parseface(fields.tail, m.nverts, m.nnorms);
        }
    }

At this point the model is parsed, though its not necessarily consistent. Faces indices may still be out of range. The next step is to transform it into a more useful representation.

Transformation

Rendering the model is the easiest way to verify it came out alright, and it’s generally useful for debugging problems. Because it basically does all the hard work for us, and doesn’t require ridiculous contortions to access, I’m going to render with old school OpenGL 1.1. It provides a glInterleavedArrays function with a bunch of predefined formats. The one that interests me is GL_N3F_V3F, where each vertex is a normal and a position. Each face is three such elements. I came up with this:

typedef struct {  // GL_N3F_V3F
    Vert n, v;
} N3FV3F[3];

typedef struct {
    N3FV3F   *data;
    ptrdiff_t len;
} N3FV3Fs;

// Transform a model into a GL_N3F_V3F representation.
N3FV3Fs n3fv3fize(Arena *, Model);

If you’re being precise you’d use GLfloat, but this is good enough for me. By using a different arena for this step, we can discard the OBJ data once it’s in the “local” format. For example:

    Arena perm    = {...};
    Arena scratch = {...};

    N3FV3Fs *scene = new(&perm, nmodels, N3FV3Fs);
    for (int i = 0; i < nmodels; i++) {
        Arena temp  = scratch;  // free OBJ at end of iteration
        Str   obj   = loadfile(&temp, path[i]);
        Model model = parseobj(&temp, obj);
        scene[i]    = n3fv3fize(&perm, model);
    }

The conversion allocates the GL_N3F_V3F array, discards invalid faces, and copies the valid faces into the array:

N3FV3Fs n3fv3fize(Arena *a, Model m)
{
    N3FV3Fs r = {0};
    r.data = new(a, m.nfaces, N3FV3F);
    for (ptrdiff_t f = 0; f < m.nfaces; f++) {
        _Bool valid = 1;
        for (int i = 0; i < 3; i++) {
            valid &= m.faces[f].v[i]>0 && m.faces[f].v[i]<=m.nverts;
            valid &= m.faces[f].n[i]>0 && m.faces[f].n[i]<=m.nnorms;
        }

        if (valid) {
            ptrdiff_t t = r.len++;
            for (int i = 0; i < 3; i++) {
                r.data[t][i].n = m.norms[m.faces[f].n[i]-1];
                r.data[t][i].v = m.verts[m.faces[f].v[i]-1];
            }
        }
    }
    return r;
}

Here’s what that looks like in OpenGL with suzanne.obj and bmw.obj:

This was a fun little project, and perhaps you learned a new technique or two after checking it out.

Meet the new xxd for w64devkit: rexxd

2025-02-17T00:49:49Z

xxd is a versatile hexdump utility with a “reverse” feature, originally written between 1990–1996. The Vim project soon adopted it, and it’s lived there ever since. If you have Vim, you also have xxd. Its primary use cases are (1) the basis for a hex editor due to its -r reverse option that can unhexdump its previous output, and (2) a data embedding tool for C and C++ (-i). The former provides Vim’s rudimentary hex editor functionality. The second case is of special interest to w64devkit: xxd -i appears in many builds that embed arbitrary data. It’s important that w64devkit has a compatible implementation, and a freshly rewritten, improved xxd, rexxd, now replaces the original xxd (as xxd).

For those unfamiliar with xxd, examples are in order. Its default hexdump output looks like this:

$ echo hello world | xxd | tee dump
00000000: 6865 6c6c 6f20 776f 726c 640a            hello world.

Octets display in pairs with an ASCII text listing on the right. All configurable. I can run this in reverse (-r), recovering the original input:

$ xxd -r dump
hello world

The tool reads the offset before the colon, the hexadecimal octets, and ignores the text column. By editing dump with a text editor, I can change the raw octets of the original input. From this point of view, the hexdump is actually a program of two alternating instructions: seek and write. xxd seeks to the offset, writes the octets, then repeats. It also doesn’t truncate the output file, so a hexdump can express binary patches as a seek/write program.

$ echo hello world >hello
$ echo 6: 65766572796f6e650a | xxd -r - hello
$ cat hello
hello everyone

That seeks to offset 0x6, then writes the 9 octets. The xxd parser is flexible, and I did not need to follow the default format. It figured out the format on its own, and rexxd further improves on this. We can use it to create large files out of thin air, too:

$ echo 3fffffff: 00 | xxd -r - >1G

This command creates an all-zero, 1GiB file, 1G, by seeking to just before 1GiB then writing a zero. I used >1G so that the shell would truncate the file before starting xxd — in case it was larger or contained non-zeros.

This is a “smart seek” of course, and its not literally seeking on every line. The tool tracks its file position and only seeks when necessary. If seeking fails, it simulates the seek using a write if possible. When would it not be possible? Lines need not be in order, of course, and so it may need to seek backwards. Lines can also overlap in contents. If it weren’t for buffering — or if rexxd had a unified buffer cache — then by using the same file for input and output an “xxd program” could write new instructions for itself and accidentally become Turing-complete.

The other common mode, -i, looks like this:

$ echo hello world >hello
$ xxd -i hello hello.c

Which produces this hello.c:

unsigned char hello[] = {
  0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int hello_len = 12;

Note how it converted the file name into variable names. Characters disallowed in variable names become underscores _. When reading from standard input, xxd only emits the octets. Unless the new-ish -n name option is given, in which case that becomes the variable name. This remains popular because, #embed notwithstanding, as of this writing all major toolchains remain stubborn about embedding data on their own.

The case for replacement

The idea of replacing it began with backporting the -n name option to Vim 9.0 xxd. The feature did not appear in a release until a year ago, 28 years after -i, despite its obviousness. I’ve also felt that xxd is slower than it could be, and a momentary examination reveals it’s buggier than it ought to be. As expected, a few seconds of fuzz testing xxd -r reveals bugs, and it doesn’t even require writing a single line of code:

$ afl-gcc -fsanitize=address,undefined xxd.c
$ mkdir inputs
$ echo >inputs/sample
$ afl-fuzz -i inputs/ -o fuzzout/ ./a.out -r

The Windows port is lacking in the usual ways, unable to handle Unicode paths. The new Vim 9.1 xxd -R color feature broke the Windows port, and if w64devkit included Vim 9.1 then I’d need to patch out the new bugs. As demonstrated above, at least it’s trivial to compile! It’s a single source file, xxd.c, and requires no configuration. I love that.

The more I looked, the more problems I found. It’s not doing anything terribly complex, so I expected it wouldn’t be difficult to rewrite it with a better foundation. So I did. Ignoring tests and documentation, my rewrite is about twice as long. In exchange, it’s substantially faster:

$ dd if=/dev/urandom of=bigfile bs=1M count=64

$ time orig-xxd bigfile dump
real    0m 4.40s
user    0m 2.89s
sys     0m 1.46s

$ time rexxd bigfile dump
real    0m 0.31s
user    0m 0.07s
sys     0m 0.21s

Same in reverse:

$ time orig-xxd -r dump nul
real    0m 5.81s
user    0m 5.67s
sys     0m 0.07s

$ time rexxd -r dump nul
real    0m 0.33s
user    0m 0.23s
sys     0m 0.09s

Or embedding data with rexxd:

$ time orig-xxd -i bigfile bigfile.c
real    0m 10.32s
user    0m 9.85s
sys     0m 0.37s

$ time rexxd -i bigfile bigfile.c
real    0m 0.40s
user    0m 0.07s
sys     0m 0.34s

I wanted to keep it portable and simple, so that’s without fancy SIMD processing. Just SWAR parsing, branch avoidance, no division on hot paths, and sound architecture. I also optimized for the typical case at the cost of the atypical case. It’s a little unfair to compare it to a program probably first written on a 16-bit machine, but there was time for it to pick up these techniques over the decades, too.

Unicode support works well:

$ cat π
3.14159265358979323846264338327950288419716939937510582097494
$ rexxd -i π π.c

Producing this source with Unicode variables:

unsigned char π[] = {
  0x33, 0x2e, 0x31, 0x34, 0x31, 0x35, 0x39, 0x32, 0x36, 0x35, 0x33, 0x35,
  // ...
  0x34, 0x0a
};
unsigned int π_len = 62;

Whereas the original xxd on Windows has the usual CRT problems:

$ orig-xxd -i π
orig-xxd: p: No such file or directory

It also struggles with 64-bit offsets, particularly on 32-bit hosts and LLP64 hosts like Windows. In contrast, I designed rexxd to robustly process file offsets as 64-bit on all hosts. Its tests operate on a virtual file system with virtual files at those sizes, so those paths really have been tested, too.

The original xxd only uses static allocation, which places small range limits on the configuration:

$ orig-xxd -c 1000
orig-xxd: invalid number of columns (max. 256).

In rexxd everything is arena allocated of course, and options are limited only by the available memory, so the above, and more, would work. The arena helps make the SWAR tricks possible, too, providing a fast runway to load more data at a time.

While reverse engineering the original, I documented bugs I discovered and noted them with a BUG: comment if you wanted to see more. I’m not aiming for bug compatibility, so these are not present in rexxd.

Platform layer

The xxd man page suggests using strace to examine the execution of -r reverse. That is, to monitor the seeks and writes of a binary patch in order to debug it. That’s so insightful that I decided to build that as a new -x option (think sh -x). That is, rexxd has a built-in strace on all platforms! The trace is expressed in terms of unix system calls, even on Windows:

$ printf '00:41 \n02:42 \n04:43' | rexxd -x -r - data.bin
open("data.bin", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 19
write(1, "A", 1) = 1
lseek(1, 2, SEEK_SET) = 2
read(0, ..., 4096) = 0
write(1, "B", 1) = 1
lseek(1, 4, SEEK_SET) = 4
write(1, "C", 1) = 1
exit(0) = ?

Is this doing some kind of self-ptrace debugger voodoo? Nope. Like u-config, it has a platform layer, and it simply logs the platform layer calls — except for the trace printout itself of course. While the intention is to debug binary patches, it was also quite insightful in examining rexxd itself. It helped me spot that rexxd flushed more often than strictly necessary.

To port rexxd to any system, define Plt as needed, implement these five plt_ functions, then call xxd. The five functions mostly have the expected unix-like semantics:

typedef struct Plt Plt;
b32  plt_open(Plt *, i32 fd, u8 *path, b32 trunc, Arena *);
i64  plt_seek(Plt *, i32 fd, i64 off, i32 whence);
i32  plt_read(Plt *, u8 *buf, i32 len);
b32  plt_write(Plt *, i32 fd, u8 *buf, i32 len);
void plt_exit(Plt *, i32);
i32  xxd(i32 argc, u8 **argv, Plt *, byte *heap, iz heapsize);

If the platform wants these functions to be “virtual” then it can put function pointers in the Plt struct. Otherwise it stores anything it might need in Plt. Global variables are never necessary. The application layer doesn’t use the standard library except (indirectly) memset and memcpy, and it allocates everything it uses from the provided heap parameter.

plt_open is a little unusual in that it picks the file descriptor: 0 to replace standard input, or 1 to replace standard output. All platforms currently use a virtual file descriptor table, and these do not map onto the real process file descriptors. But they could! Calls are straced in the application layer, so they log virtual file descriptors as seen by rexxd. The arena parameter offers scratch space for the Windows platform layer to convert paths from narrow to wide for CreateFileW, so it can handle long path names with ease.

plt_read doesn’t accept a file descriptor because there’s only one from which to read, 0. plt_write on the other hand allows writing to standard error, 2.

plt_exit doesn’t return, of course. In tests it longjmps back to the top level, as though returning from xxd with a status. This lets me skip allocation null pointer checks, with OOM unwinding safely back to the top level. Since rexxd allocates everything from the arena, it’s all automatically deallocated, so it’s a clean exit.

On Windows, plt_seek calls SetFilePointerEx. I learned the hard way that the behavior of calling it on a non-file is undefined, not an error, so at least one GetFileType call is mandatory. I also learned that Windows will successfully seek all the way to INT64_MAX. If the file system doesn’t support that offset, it’s a write failure later. For correct operation, rexxd must take care not to overflow its own internal file position tracking near these offsets with Windows allowing seeks to operate at the edge until the first flush. Tests run on a virtual file system thanks to the platform layer, and some tests permit huge seeks and simulate impossibly enormous files in order to probe behavior at the extremes.

This is in contrast to Linux, where seeks beyond the underlying file system’s supported file size is a seek error. For example, on ext4 with the default configuration:

$ echo ffffffff000: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040320, SEEK_SET) = 17592186040320
read(0, ..., 4096) = 0
write(1, "\0", 1) = -1
exit(3) = ?

We can see the seek succeeded then the write failed because it went one byte beyond the file system limit. While seeking one byte further will cause the seek to fail (22 EINVAL), and rexxd falls back on write until it fills the storage and runs out of space:

$ echo ffffffff001: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040321, SEEK_SET) = -1
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
...

Mostly for fun, I wrote a libc-free platform layer using raw Linux system calls, and it maps almost perfectly onto the kernel interface:

struct Plt { int fds[3]; };

b32 plt_open(Plt *plt, i32 fd, u8 *path, b32 trunc, Arena *)
{
    i32 mode = fd ? O_CREAT|O_WRONLY : 0;
    mode |= trunc ? O_TRUNC : 0;
    plt->fds[fd] = (i32)syscall3(SYS_open, (uz)path, mode, 0666);
    return plt->fds[fd] >= 0;
}

i64 plt_seek(Plt *plt, i32 fd, i64 off, i32 whence)
{
    return syscall3(SYS_lseek, plt->fds[fd], off, whence);
}

i32 plt_read(Plt *plt, u8 *buf, i32 len)
{
    return (i32)syscall3(SYS_read, plt->fds[0], (uz)buf, len);
}

b32 plt_write(Plt *plt, i32 fd, u8 *buf, i32 len)
{
    return len == syscall3(SYS_write, plt->fds[fd], (uz)buf, len);
}

void plt_exit(Plt *, i32 r)
{
    syscall3(SYS_exit, r, 0, 0);
}

On Windows I use the artisanal function prototypes of which I’ve grown so fond. It’s also my first time using w64devkit’s -lmemory in a serious application. I’m using -lchkstk in the “xxd as a DLL” platform layer, too, but that one’s just a toy. In that one I use alloca to allocate an arena, which is a rather novel combination, and the large stack frame requires a stack probe. Otherwise none of rexxd requires stack probes.

w64devkit’s new xxd.exe is delightfully tidy as viewed by peports:

$ du -h xxd.exe
28.0K   xxd.exe
$ peports xxd.exe
KERNEL32.dll
        0       CreateFileW
        0       ExitProcess
        0       GetCommandLineW
        0       GetFileType
        0       GetStdHandle
        0       MultiByteToWideChar
        0       ReadFile
        0       SetFilePointerEx
        0       VirtualAlloc
        0       WideCharToMultiByte
        0       WriteFile
SHELL32.dll
        0       CommandLineToArgvW

Other notes

Buffered output and buffered input is custom tailored for rexxd. When parsing line-oriented input, like -r, it attempts to parse from of a view of the input buffer, no copying. The view is the usual string representation:

typedef struct {
    u8 *data;
    iz  len;
} Str;

Does it fail if the line is longer than the buffer? If it straddles reads, does that hurt efficiency? The answer to both is “no” due to the spillover arena. Input is the buffered input struct, and here’s the interface to get the next line:

Str nextline(Input *, Arena *);

If the line isn’t entirely contained in the input buffer, the complete line is concatenated in the arena. So it comfortably handles huge lines while no-copy optimizing for typical short, non-straddling lines. With a per-iteration arena, any arena-backed line is automatically freed at the end of the iteration, so it’s all transparent:

    for (;;) {
        Arena scratch = perm;
        Str line = nextline(b, &scratch);
        // ... line may point into an Input or scratch ...
    }

If the line doesn’t fit in the arena, it triggers OOM handling. That is, it calls plt_exit and something platform-appropriate happens without returning. Beats the pants off old getline!

I came up with a maxof macro that evaluates the maximum of any integral type, signed or unsigned. It appears in overflow checks and more, I really like how it turned out. For example:

    if (pos > maxof(i64) - off) {
        // overflow
    }
    pos += off;

Or:

i32 trunc32(iz n)
{
    return n>maxof(i32) ? maxof(i32) : (i32)n;
}

Now that I have -lmemory and generally solved string function issues for myself, I leaned into __builtin_memset and __builtin_memcpy for this project. Despite restrict, it’s surprisingly difficult to get compilers to optimize loops into semantically equivalent string function calls. An explicit built-in solves that. It also produces faster debug builds, which is what I run while I work. At -O0, rexxd is about half the speed of a release build.

Other than -x, I don’t plan on inventing new features. I’d like to maintain compatibility with the xxd found everywhere else, and I don’t expect adoption beyond w64devkit. Overall the project took about twice as long as I anticipated — two weekends instead of one — but it turned out better than I expected and I’m very pleased with the results.

Tips for more effective fuzz testing with AFL++

2025-02-05T18:03:55Z

Fuzz testing is incredibly effective for mechanically discovering software defects, yet remains underused and neglected. Pick any program that must gracefully accept complex input, written in any language, which has not yet been been fuzzed, and fuzz testing usually reveals at least one bug. At least one program currently installed on your own computer certainly qualifies. Perhaps even most of them. Everything is broken and low-hanging fruit is everywhere. After fuzz testing ~1,000 projects over the past six years, I’ve accumulated tips for picking that fruit. The checklist format has worked well in the past (1, 2), so I’ll use it again. This article discusses AFL++ on source-available C and C++ targets, running on glibc-based Linux distributions, currently the indisputable best fuzzing platform for C and C++.

My tips complement the official, upstream documentation, so consult them, too:

Performance Tips on the AFL++ website
Technical “whitepaper” for afl-fuzz

Even if a program has been fuzz tested, applying the techniques in this article may reveal defects missed by previous fuzz testing.

(1) Configure sanitizers and assertions

More assertions means more effective fuzzing, and sanitizers are a kind of automatically-inserted assertions. By default, fuzz with both Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan):

$ afl-gcc-fast -g3 -fsanitize=address,undefined ...

ASan’s default configuration is not ideal, and should be adjusted via the ASAN_OPTIONS environment variable. If customized at all, AFL++ requires at least these options:

export ASAN_OPTIONS="abort_on_error=1:halt_on_error=1:symbolize=0"

Except symbolize=0, this ought to be the ASan default. When debugging a discovered crash, you’ll want UBSan set up the same way so that it behaves under in a debugger. To improve fuzzing, make ASan even more sensitive to defects by detecting use-after-return bugs. It slows fuzzing slightly, but it’s well worth the cost:

ASAN_OPTIONS+=":detect_stack_use_after_return=1"

By default ASan fills the first 4KiB of fresh allocations with a pattern, to help detect use-after-free bugs. That’s not nearly enough for fuzzing. Crank it up to completely fill virtually all allocations with a pattern:

ASAN_OPTIONS+=":max_malloc_fill_size=$((1<<30))"

In the default configuration, if a program allocates more than 4KiB with malloc then, say, uses strlen on the uninitialized memory, no bug will be detected. There’s almost certainly a zero somewhere after 4KiB. Until I noticed it, the 4KiB limit hid a number of bugs from my fuzz testing. Per (4), fulling filling allocations with a pattern better isolates tests when using persistent mode.

When fuzzing C++ and linking GCC’s libstdc++, consider -D_GLIBCXX_DEBUG. ASan cannot “see” out-of-bounds accesses within a container’s capacity, and the extra assertions fill in the gaps. Mind that it changes the ABI, though fuzz testing will instantly highlight such mismatches.

(2) Prefer the persistent mode

While AFL++ can fuzz many programs in-place without writing a single line of code (afl-gcc, afl-clang), prefer AFL++’s persistent mode (afl-gcc-fast, afl-clang-fast). It’s typically an order of magnitude faster and worth the effort. Though it also has pitfalls (see (4), (5)). I keep a file on hand, fuzztmpl.c — the progenitor of all my fuzz testers:

#include 

__AFL_FUZZ_INIT();

int main(void)
{
    __AFL_INIT();
    char *src = 0;
    unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF;
    while (__AFL_LOOP(10000)) {
        int len = __AFL_FUZZ_TESTCASE_LEN;
        src = realloc(src, len);
        memcpy(src, buf, len);
        // ... send src to target ...
    }
}

I :r this into my Vim buffer, then modify as needed. It’s a stripped and improved version of the official template, which itself has a serious flaw (see (5)). There are unstated constraints about the position of buf and len in the code, so if in doubt, refer to the original template.

(3) Include source files, not header files

We’re well into the 21st century. Nobody is compiling software on 16-bit machines anymore. Don’t get hung up on the one translation unit (TU) per source file mindset. When fuzz testing, we need at most two TUs: One TU for instrumented code and one TU for uninstrumented code. In most cases the latter takes the form of a library (libc, libstdc++, etc.) and we don’t need to think about it.

Fuzz testing typically requires only a subset of the program. Including just those sources straight in the template is both effective and simple. In my template I put includes just above unistd.h so that the header isn’t visible to the sources unless they include it themselves.

#include "src/utils.c"
#include "src/parser.c"
#include 

I know, if you’ve never seen this before it looks bonkers. This isn’t what they taught you in college. Trust me, this simple technique will save you a thousand lines of build configuration. Otherwise you’ll need to manage different object files between fuzz testing and otherwise.

Perhaps more importantly, you can now fuzz test any arbitrary function in the program, including static functions! They’re all right there in the same TU. You’re not limited to public-facing interfaces. Perhaps you can skip (7) and test against a better internal interface. It also gives you direct access to static variables so that you can clear/reset them between tests, per (4).

Programs are often not designed for fuzz testing, or testing generally, and it may be difficult to tease apart tightly-coupled components. Many of the programs I’ve fuzz tested look like this. This technique lets you take a hacksaw to the program and substitute troublesome symbols just for fuzz testing without modifying a single original source line. For example, if the source I’m testing contains a main function, I can remove it:

#define main oldmain
#  include "src/utils.c"
#  include "src/parser.c"
#undef main
#include 

Sure, better to improve the program so that such hacks are unnecessary, but most cases I’m fuzz testing as part of a drive-by review of some open source project. It allows me to quickly discover defects in the original, unmodified program, and produces simpler bug reports like, “Compile with ASan, open this 50-byte file, and then the program will crash.”

(4) Isolate fuzz tests from each other

Tests should be unaffected by previous tests. This is challenging in persistent mode, sometimes even impractical. That means resetting all global state, even something like the internal strtok buffer if that function is used. Add fuzz testing to your list of reasons to eschew global variables.

It’s mitigated by (1), but otherwise uninitialized heap memory may hold contents from previous tests, breaking isolation. Besides interference with fuzzing instrumentation, bugs found this way are wickedly difficult to reproduce.

Don’t pass uninitialized memory into a test, e.g. an output parameter allocated on the stack. Zero-initialize or fill it with a pattern. If it accepts an arena, fill it with a pattern before each test.

Typically you have little control over heap addresses, which likely varies across tests and depends on the behavior previous tests. If the program depends on address values, this may affect the results and make reproduction difficult, so watch for that.

(5) Do not test directly on the fuzz test buffer

Passing buf and len straight into the target is the most common mistake, especially when fuzzing better-designed C programs, and particularly because the official template encourages it.

    myprogram(buf, len);  // BAD!

While it’s a great sign the program doesn’t depend on null termination, it creates a subtle trap. The underlying buffer allocated by AFL++ is larger than len, and ASan will not detect read overflows on inputs! Instead pass a copy sized to fit, which is the purpose of src in my template. Adjust the type of src as needed.

If the program expects null-terminated input then you’ll need to do this anyway in order to append the null byte. If it accepts an “owning” type like std::string, then it’s also already done on your behalf. With “non-owning” views like std::string_view you’ll still want to your own size-fit copy.

If you see a program’s checked in fuzz test using buf directly, make this change and see if anything new pops out. It’s worked for me on a number of occasions.

(6) Don’t bother freeing memory

In general, avoid doing work irrelevant to the fuzz test. The official tips say to “use a simpler target” and “instrument just what you need,” and keeping destructors out of the tests helps in both cases. Unless the program is especially memory-hungry, you won’t run out of memory before AFL++ resets the target process.

If not for (1), it also helps with isolation (4), as different tests are less likely contaminated with uninitialized memory from previous tests.

As an exception, if you want your destructor included in the fuzz test, then use it in the test. Also, it’s easy to exhaust non-memory resources, particularly file descriptors, and you may need to clean those up in order to fuzz test reliably.

Of course, if the target uses arena allocation then none of this matters! It also makes for perfect isolation, as even addresses won’t vary between tests.

(7) Use a memory file descriptor to back named paths

Many interfaces are, shall we say, not so well-designed and only accept input from a named file system path, insisting on opening and reading the file themselves. Testing such interfaces presents challenges, especially if you’re interested in parallel fuzzing. Fortunately there’s usually an easy out: Create a memory file descriptor and use its /proc name.

int fd = memfd_create("fuzz", 0);
assert(fd == 3);
while (...) {
    // ...
    ftruncate(fd, 0);
    pwrite(fd, buf, len, 0);
    myprogram("/proc/self/fd/3");
}

With standard input as 0, output as 1, and error as 2, I’ve assumed the memory file descriptor will land on 3, which makes the test code a little simpler. If it’s not 3 then something’s probably gone wrong anyway, and aborting is the best option. If you don’t want to assume, use snprintf or whatever to construct the path name from fd.

Using pwrite (instead of write) leaves the file description offset at the beginning of the file.

Thanks to the memory file descriptor, fuzz test data doesn’t land in permanent storage, so less wear and tear on your SSD from the occasional flush. Because of /proc, the file is unique to the process despite the common path name, so no problems parallel fuzzing. No cleanup needed, either.

If the program wants a file descriptor — i.e. it wants a socket because you’re fuzzing some internal function — pass the file descriptor directly:

    myprogram(fd);

If it accepts a FILE *, you could fopen the /proc path, but better to use fdmemopen to create a FILE * on the object:

    myprogram(fdmemopen(buf, len, "rb"));

Note how, per (6), we don’t need to bother with fclose because it’s not associated with a file descriptor.

(8) Configure the target for smaller buffers

A common sight in diseased programs are “generous” fixed buffer sizes:

#define MY_MAX_BUFFER_LENGTH 65536

void example(...)
{
    char path[PATH_MAX];  // typically 4,096
    char buf[MY_MAX_BUFFER_LENGTH];
    // ...
}

These huge buffers tend to hide bugs. Turn those stones over! It takes a lot of fuzzing time to max them out and excite the unhappy paths — or the super-unhappy paths, overflows. Better if the fuzz test can reach worst case conditions quickly and explore the execution paths out of it.

So when you see these, cut them way down, possibly using (3). Change 65536 to, say, 16 and see what happens. If fuzzing finds a crash on the short buffer, typically extending the input to crash on the original buffer size is straightforward, e.g. repeat one of the bytes even more than it already repeats.

Conclusion and samples

Hopefully something here will help you catch a defect that would have otherwise gone unnoticed. Even better, perhaps awareness of these fuzzing techniques will prevent the bug in the first place. Thanks to my template, some solid tooling, and the know-how in this article, I can whip up a fuzz test in a couple of minutes. But that ease means I discard it as just as casually, and so I don’t take time to capture and catalog most. If you’d like to see some samples, I do have an old, short list. Perhaps after another kiloproject of fuzz testing I’ll pick up more techniques.

Examples of quick hash tables and dynamic arrays in C

2025-01-19T04:10:33Z

This article durably captures my reddit comment showing techniques for std::unordered_map and std::vector equivalents in C programs. The core, important features of these data structures require only a dozen or so lines of code apiece. They compile quickly, and tend to run faster in debug builds than release builds of their C++ equivalents. What they lack in genericity they compensate in simplicity. Nothing here will be new. Everything has been covered in greater detail previously, which I will reference when appropriate.

For a concrete goal, we will build a data structure representing an process environment, along with related functionality to make it more interesting. That is, we’ll build a string-to-string map.

Allocator

The foundation is our allocator, a simple bump allocator, so we’ll start there:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;

void *alloc(Arena *a, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align)
{
    ptrdiff_t pad = -(uintptr_t)a->beg & (align - 1);
    assert(count < (a->end - a->beg - pad)/size);  // TODO: OOM policy
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return memset(r, 0, count*size);
}

Allocating through the new macro eliminates several classes of common defects in C programs. If we get our types mixed up we get errors, or at least warnings. Our size calculations cannot overflow. We cannot accidentally use uninitialized memory. We cannot leak memory; deallocating is implicit. The main downside is that it doesn’t fit some less common allocator requirements.

Strings

Next, a string representation. Classic null-terminated strings are an error-prone paradigm, so we’ll use counted strings instead:

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

This is equivalent to a std::string_view in C++. The macro allows us to efficiently convert string literals into Str objects. Because our data structures are backed by arenas, we won’t care whether a particular string is backed by a static string, arena, memory map, etc. We’ll also need a function to compare strings for equality:

_Bool equals(Str a, Str b)
{
    if (a.len != b.len) {
        return 0;
    }
    return !a.len || !memcmp(a.data, b.data, a.len);
}

!a.len appears superfluous, but it’s necessary: memcmp arbitrarily forbids null pointers, and we may be passed a zero-initialized Str. Though this is scheduled to be corrected.

We’ll need a string hash function, too:

uint64_t hash64(Str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

This is an FNV-style hash. The “basis” keeps strings of nulls from getting stuck at zero, and the multiplier is my favorite prime number. Character data is fixed to 0–255 rather than allowing the signedness of char to influence the results. As a multiplicative hash, the high bits are mixed better than the low bits, and our maps will take that into account.

Flat hash map

We have a couple string-to-string map options. The more restrictive, but more efficient — in terms of memory use and speed — is a Mask-Step-Index (MSI) hash table. I don’t think it fits our problem as well as the next option, particularly because it puts a hard limit on unique keys, but it’s worth evaluating. Let’s call it FlatEnv:

enum { ENVEXP = 10 };  // support up to 1,000 unique keys
typedef struct {
    Str keys[1<<ENVEXP];
    Str vals[1<<ENVEXP];
} FlatEnv;

It’s nothing more than two fixed-length arrays, storing keys and values separately. Keys with null pointers are empty slots, so a zero-initialized FlatEnv is an empty table. They come out of an arena ready-to-use:

    FlatEnv *env = new(a, 1, FlatEnv);  // new, empty environment

Now we leverage equals and hash64 for a double-hashed, open address search on the keys array:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key);
    uint32_t mask = (1<<ENVEXP) - 1;
    uint32_t step = (hash>>(64 - ENVEXP)) | 1;
    for (int32_t i = hash;;) {
        i = (i + step) & mask;
        if (!env->keys[i].data) {
            env->keys[i] = key;
            return env->vals + i;
        } else if (equals(env->keys[i], key)) {
            return env->vals + i;
        }
    }
}

By returning a pointer to the unmodified value slot, this function covers both lookup and insertion. So that’s the entire hash table implementation. To insert, the caller assigns the slot. For mere lookup, check the slot for a null pointer.

    FlatEnv *env = new(a, 1, FlatEnv);

    // insert
    *flatlookup(env, S("hello")) = S("world");

    // lookup
    Str val = *flatlookup(env, key);
    if (val.data) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val.len, val.data);
    }

To iterate over the map entries, iterate over the arrays, skipping null entries. Per the ENVEXP comment, it’s hard-coded to support up to 1,000 unique keys (1,024 slots, leaving some to spare). The table itself doesn’t enforce this limit and will turn into an infinite loop if you insert too many keys. To support scaling, we could design the map to have dynamic table sizes, track the number of unique keys, and resize the table (allocate new arrays) when the load factor crosses a threshold. Resizing sounds messy and complicated, so fortunately there’s another option.

Hierarchical hash map

If the number of keys is unbounded, hash tries work better. Trees scale well, and we can allocate nodes out of the arena as it grows. We’ll use a 4-ary trie, a good default that balances size and performance:

typedef struct Env Env;
struct Env {
    Env *child[4];
    Str  key;
    Str  value;
};

An empty map is just a null pointer, and so, again, these maps come ready-to-use in their zero state:

    Env *env = 0;  // new, empty environment

The implementation is equally as brief:

Str *lookup(Env **env, Str key, Arena *a)
{
    for (uint64_t h = hash64(key); *env; h <<= 2) {
        if (equals(key, (*env)->key)) {
            return &(*env)->value;
        }
        env = &(*env)->child[h>>62];
    }
    if (!a) return 0;
    *env = new(a, 1, Env);
    (*env)->key = key;
    return &(*env)->value;
}

Like before, this covers both lookup and insertion, though the mode is determined explicitly by the arena pointer. Without an arena, it’s a lookup, which doesn’t require allocation. With an arena, it creates an entry if necessary and, like before, returns a pointer into the map so that the caller can assign it. Usage differs only slightly:

    Env *env = 0;

    // insert
    *lookup(env, S("hello"), &scratch) = S("world");

    // lookup
    Str *val = lookup(env, key, 0);
    if (val) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val->len, val->data);
    }

We’ll come back around to iteration later.

String concatenation

Next I’d like a function that takes an Env and produces an envp data structure as expected by execve(2). Then we can use this map as the environment in a child process. We’ll need some string manipulation, particularly string concatenation. The core is a copy function:

Str copy(Arena *a, Str s)
{
    Str r = s;
    r.data = new(a, s.len, char);
    if (r.len) memcpy(r.data, s.data, r.len);
    return r;
}

Like with memcmp, because it’s memcpy we need to handle the arbitrary special case around null pointers should the input be a zero Str. Now we can easily concatenate strings, in-place if possible:

Str concat(Arena *a, Str head, Str tail)
{
    if (!head.data || head.data+head.len != a->beg) {
        head = copy(a, head);
    }
    head.len += copy(a, tail).len;
    return head;
}

Yet again, !head.data is special check because pointer arithmetic on null (i.e. adding zero to null) is arbitrarily disallowed. Worrying about this is exhausting, isn’t it? That language fix can’t come soon enough. This one’s already fixed in C++.

That’s enough to get the ball rolling on FlatEnv:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    int    cap  = 1<<ENVEXP;
    char **envp = new(a, cap, char *);
    int    len  = 0;
    for (int i = 0; i < cap; i++) {
        if (env->vals[i].data) {
            Str pair = env->keys[i];
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, env->vals[i]);
            pair = concat(a, pair, S("\0"));
            envp[len++] = pair.data;
        }
    }
    return envp;
}

Simple, right? Traditional string handling in C is an error-prone pain, but with a better set of primitives it’s a breeze. Plus we’re doing this all with essentially no runtime. In use this might look like:

void shellexec(char *cmd, FlatEnv *env, Arena scratch)
{
    char  *argv[] = {"sh", "-c", cmd, 0};
    char **envp   = flat_to_envp(env, &scratch);
    execve("/bin/sh", argv, envp);
}

By virtue of the scratch arena, the envp object is automatically freed should execve fail. (If that should even matter.) Considering this, if you’re itching to write the fastest shell ever devised, arena allocation and the techniques in this article would probably get you most of the way there. Nobody writes shells this way.

Dynamic arrays

To implement the envp conversion for the hash trie Env, let’s add one more tool to our toolbox: dynamic arrays. Our std::vector equivalent. We’ll start with a familiar slice header:

typedef struct {
    char    **data;
    ptrdiff_t len;
    ptrdiff_t cap;
} EnvpSlice;

The bad news is that we don’t have templates, and so we’ll need to define one such structure for each type of which we want a dynamic array. This one is set up to create an envp array. The good news is that manipulation occurs through generic code, so everything else is reusable.

I want a push macro that creates an empty slot in which to insert a new value, evaluating to a pointer to this slot. Usually that means incrementing len, but when out of room it will need to expand the underlying storage. It’s clearer to start with example usage. Imagine using it with the previous flat_to_envp:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    EnvpSlice r = {0};
    for (int i = 0; i < 1<<ENVEXP; i++) {
        if (env->vals[i].data) {
            // ... concat as before ...
            *push(a, &r) = pair.data;
        }
    }
    push(a, &r);  // terminal null pointer
    return r.data;
}

Continuing the theme, a zero-initialized slice is a ready-to-use empty slice, and most begin life this way. The immediate dereference on push is just like those calls to lookup. If expansion is needed, the push macro’s job is to pull fields off the slice, pass them into a helper function which agnostically, strict-aliasing-legally, manipulates the slice header:

void *push_(Arena *, void *data, ptrdiff_t *pcap, ptrdiff_t size);

#define push(a, s) \
  ((s)->len == (s)->cap \
    ? (s)->data = push_((a), (s)->data, &(s)->cap, sizeof(*(s)->data)), \
      (s)->data + (s)->len++ \
    : (s)->data + (s)->len++)

The internals of that helper look an awful lot like concat, with the same in-place-if-possible behavior:

enum { SLICE_INITIAL_CAP = 4 };

void *push_(Arena *a, void *data, ptrdiff_t *pcap, ptrdiff_t size)
{
    ptrdiff_t cap   = *pcap;
    ptrdiff_t align = _Alignof(void *);

    if (!data || a->beg != (char *)data + cap*size) {
        void *copy = alloc(a, cap, size, align);
        if (data) memcpy(copy, data, cap*size);
        data = copy;
    }

    ptrdiff_t extend = cap ? cap : SLICE_INITIAL_CAP;
    alloc(a, extend, size, 1);  // already aligned
    *pcap = cap + extend;
    return data;
}

(Update: Aleh pointed out an inefficiency in the original code: applying alignment in the second alloc may introduce unnecessary fragmentation. This has been corrected above.)

For unfathomable reasons, standard C does not permit _Alignof on expressions, so slice data is simply pointer-aligned. (The more shrewd might consider max_align_t.) Like concatenation, we copy the object to the beginning of the arena if necessary, and extend the allocation by allocating the usual way, being careful not to increment the capacity until after it succeeds.

Update: NRK points out we can use __typeof__ (extension) or typeof (C23), to work around this syntactical limitation of _Alignof. Convert the align local variable into a parameter:

void *push_(..., ptrdiff_t align);

Then in the macro pass it via _Alignof(__typeof__(…)):

#define push(a, s) \
  ((s)->len == (s)->cap \
    ? (s)->data = push_((a), (s)->data, &(s)->cap, \
          sizeof(*(s)->data), _Alignof(__typeof__(*(s)->data))), \
      (s)->data + (s)->len++ \
    : (s)->data + (s)->len++)

Spelled as an extension, it already works with all major C compilers from the past decade, and without requiring special compiler flags.

We can now use push on any structure with data, len, and cap fields of the appropriate types.

Putting it all together

With that in place, we can define a simple, recursive version of the envp builder for Env:

#define countof(a)  ((ptrdiff_t)(sizeof(a) / sizeof(*(a))))

EnvpSlice env_to_envp_(EnvpSlice r, Env *env, Arena *a)
{
    if (env) {
        Str pair = env->key;
        pair = concat(a, pair, S("="));
        pair = concat(a, pair, env->value);
        pair = concat(a, pair, S("\0"));
        *push(a, &r) = pair.data;
        for (int i = 0; i < countof(env->child); i++) {
            r = env_to_envp_(r, env->child[i], a);
        }
    }
    return r;
}

char **env_to_envp(Env *env, Arena *a)
{
    EnvpSlice r = {0};
    r = env_to_envp_(r, env, a);
    push(a, &r);  // null pointer terminator
    return r.data;
}

As is often the case, the recursive part doesn’t fit the final interface, so the core is a helper, and the caller-facing part is an adapter. I’m not entirely comfortable with this function, though. When working with huge environments — over a ~100k entries — then the recursive implementation will non-deterministically blow the stack if the trie winds up lopsided. Or deterministically for chosen pathological inputs, because the hash function isn’t seeded.

Instead we could use a stack data structure backed by the arena to traverse the trie. If passed a secondary scratch arena, we’d use that arena for this stack, but I’m sticking to the original interface. Here’s what that looks like, with an extra trick thrown in just to show off:

char **env_to_envp_safe(Env *env, Arena *a)
{
    EnvpSlice r = {0};

    typedef struct {
        Env *env;
        int  index;
    } Frame;
    Frame init[16];  // small size optimization

    struct {
        Frame    *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } stack = {init, 0, countof(init)};

    *push(a, &stack) = (Frame){env, 0};
    while (stack.len) {
        Frame *top = stack.data + stack.len - 1;

        if (!top->env) {
            stack.len--;

        } else if (top->index == countof(top->env->child)) {
            Str pair = top->env->key;
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, top->env->value);
            pair = concat(a, pair, S("\0"));
            *push(a, &r) = pair.data;
            stack.len--;

        } else {
            int i = top->index++;
            *push(a, &stack) = (Frame){top->env->child[i], 0};
        }
    }

    push(a, &r);
    return r.data;
}

The init array is a form of small-size optimization. It’s used at first, and sufficient for nearly all inputs. So no stack litter in the arena. If it’s not enough, then push will automatically move the stack into the arena. I think that’s a super duper neato trick!

Alternative to this, and as discussed in the original hash trie article, we could instead add a next field to Env as an intrusive linked list that chains the nodes together in insertion order. Or another way to look at it, Env is a linked list with an intrusive hash trie for O(log n) searches on the list. That’s a lot simpler, has other useful properties, and only costs one extra pointer per entry. And we wouldn’t need slices, which was my motivation for choosing non-linked-list approach above.

Hash hardening (bonus)

Okay, I lied, this is something new. Think of it as your special treat for sticking with me so far.

Hash map non-determinism comes with a classic security vulnerability: If populated with untrusted keys, an attacker could choose colliding keys and produce worst case behavior in the hash map. That is, MSI hash tables reduce to linear scans, and hash tries reduce to linked lists. Worse, the recursive envp function blows the stack, though we already solved that issue.

If we want to foil such attacks, we can seed the hash so that an attacker cannot devise collisions. They’d need to discover the seed. We might even call that seed a “key,” but this is a non-cryprographic hash so I’m going to avoid that term. The usual implementation of this concept involves generating a seed, sometimes per table, and storing it somewhere. However, we can leverage an existing security mechanism, gaining this feature at basically no cost: Address Space Layout Randomization (ASLR). First, let’s augment the string hash function:

uint64_t hash64(Str s, uint64_t seed)
{
    uint64_t h = seed;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

In flatlookup we can use the address of the FlatEnv as our seed:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key, (uintptr_t)env);
    // ...
}

Recall it’s allocated out of our arena (via new), and ASLR gives our arena a random offset. On top of that, a FlatEnv seed depends precisely on the amount of memory allocated earlier. An environment variable name or value being slightly longer or shorter will reshuffle the whole table if allocated in the arena before the FlatEnv.

It’s slightly trickier with hash tries. The root pointer isn’t required to be fixed. For example:

    Env *env = 0;
    // ... insert keys ...
    Env *myenv = env;
    // ... lookup keys in myenv ...

We could disallow this, but it would be easy to forget (e.g. while you’re refactoring and not thinking about it) and difficult to detect. Difficult-to-detect bugs keep me awake at night. Instead we can use the root node to seed the trie:

Str *lookup(Env **env, Str key, Arena *a)
{
    uint64_t seed = env ? (uintptr_t)*env : 0;
    for (uint64_t h = hash64(key, seed); *env; h <<= 2) {
    // ...
}

At first this seems like it couldn’t work, like a chicken-and-egg problem. There’s no root node at first, so we can’t know the seed yet. Though think about it a little longer and it should be obvious: The hash is unused when inserting the very first element. It simply becomes the root of the trie. The seed is irrelevant until the second insert, at which point we’ve established a seed. This delay establishing the seed means hash tries are even more randomized.

With the proper tools and representations, working in C isn’t difficult even if you need containers and string manipulation. Aside from memcmp and memcpy — each easily replaceable — we did all this without runtime assistance, not even its allocator. What a pleasant way to work!

Source from this article in runnable form, which I used to test my samples: example.c

Rules to avoid common extended inline assembly mistakes

2024-12-20T19:46:48Z

GCC and Clang inline assembly is an interface between high and low level programming languages. It is subtle and treacherous. Many are ensnared in its traps, usually unknowingly. As such, the asm keyword is essentially the unsafe keyword of C and C++. Nearly every inline assembly tutorial, including the awful ibilio page at the top of search engines for decades, propagate fundamental, serious mistakes, and most examples are incorrect. The dangerous part is that the examples usually produce the expected results! The situation is dire. This article isn’t a tutorial, but basic rules to avoid the most common mistakes, or to spot them in code review.

The focus is entirely extended assembly, and not basic assembly, which has different rules. The former is any inline assembly statement with constraints or clobbers. That is, there’s a colon : token between the asm parenthesis. Basic assembly is blunt and has fewer uses, mostly at the top level or in “naked” functions, making misuse less likely.

(1) Avoid inline assembly if possible

Because it’s so treacherous, the first rule is to avoid it if at all possible. Modern compilers are loaded with intrinsics and built-ins that replace nearly all the old inline assembly use cases. They allow access to low level features from the high level language. No need to bridge the gap between low and high yourself when there’s an intrinsic.

Compilers do not have built-ins for system calls, and occasionally lack a useful intrinsic. Other times you might be building foundational infrastructure. These remaining cases are mostly about interacting with external interfaces, not optimization nor performance.

(2) It should nearly always be volatile

Falling right out of rule (1), the remaining inline assembly cases nearly always have side effects beyond output constraints. That includes memory accesses, and it certainly includes system calls. Because of this, inline assembly should usually have the volatile qualifier.

asm volatile ( ... );

This prevents compilers from eliding or re-ordering the assembly. As a special rule, inline assembly lacking output constraints is implicitly volatile. Despite this, please use volatile anyway! When I do not see volatile it’s likely a defect. Stopping to consider if it’s this special case slows understanding and impedes code review.

Tutorials often use __volatile__. Do not do this. It is an ancient alias keyword to support pre-standard compilers lacking the volatile keyword. This is not your situation. When I see __volatile__ it likely means you copy-pasted the inline assembly from somewhere without understanding it. It’s a red flag that draws my attention for even more careful review.

Side note: __asm or __asm__ is fine, and even required in some cases (e.g. -std=cXX). I usually write it asm.

(3) It probably needs a memory clobber

The "memory" clobber is orthogonal to volatile, each serving different purposes. It’s less often needed than volatile, but typical remaining inline assembly cases require it. If memory is accessed in any way while executing the assembly, you need a memory clobber. This includes most system calls, and definitely a generic syscall wrapper.

    asm volatile (... : "memory");

In code review, if you do not see a "memory" clobber, give it extra scrutiny. It’s probably missing. If it’s truly unnecessary, I suggest documenting such in a comment so that reviewers know the omission is considered and intentional.

The constraint prevents compilers from re-ordering loads and stores around the assembly. It would be disastrous, for example, if a write(2) system call occurred before the program populated the output buffer! In this case, volatile would prevent followup write(2) from being optimized out while "memory" forces memory stores to occur before the system call.

(4) Never modify input constraints

It’s easy not to modify inputs, so this is mostly about ignorance, but this rule is broken with shocking frequency. Most of the time you can get away with it, right up until certain configurations have a heisenbug. In most cases this can be fixed by changing an input into read-write output constraint with "+":

asm volatile ("..." :: "r"(x) : ...);  // before
asm volatile ("..." : "+r"(x) : ...);  // after

If you hadn’t been using volatile (in violation of rule 2) then now suddenly you’d need it because there’s an output constraint. This happens often.

(5) Never call functions from inline assembly

Many things can go wrong because the semantics cannot be expressed using inline assembly constraints. The stack may not be aligned, and you’ll clobber the redzone. (Yes, there’s a "redzone" constraint, but its insufficient to actually make a function call.) Do not do it. Tutorials like to show it because it makes for a simple demonstration, but all those examples are littered with defects.

System calls are fine. Basic assembly may call functions when used outside of non-naked functions. The goto qualifier, used correctly, allows jumps to be safely expressed to the compiler. Just don’t use call in extended assembly.

(6) Do not define absolute assembly labels

That is, if you need to jump within your assembly block, such as for a loop, do not write a named label:

myloop:
    ...
    jz myloop

Your inline assembly is part of a function, and that function may be cloned or inlined, in which case there will be multiple copies of your assembly block in the translation unit. The assembler will see duplicate label names and reject the program. Until that function is inlined, perhaps at a high optimization level, this will likely work as expected. On the plus side it’s a loud compile time error when it doesn’t work.

In inline assembly you can have the compiler generate a unique label with %=, but my preferred solution is the local labels feature of the assembler:

0:
    ...
    jz 0b

In this case the assembler generates unique labels, and the number 0 isn’t the literal label name. 0b (“backward”) refers to the previous 0 label, and 0f (“forward”) would refer to the next 0 label. Perfectly unambiguous.

Naturally occurring practice problems

Now that you’ve made it this far, here’s an exercise for practice: Search online for “inline assembly tutorial” and count the defects you find by applying my 6 rules. You’ll likely find at least one per result that isn’t official compiler documentation. Besides tutorials and reviewing real programs, you could ask an LLM to generate inline assembly, as they’ve been been trained to produce these common defects.

Windows dynamic linking depends on the active code page

2024-10-07T19:50:17Z

Windows paths have been WTF-16-encoded for decades, but module names in the import tables of Portable Executable are octets. If a name contains values beyond ASCII — technically out of spec — then the dynamic linker must somehow decode those octets into Unicode in order to construct a lookup path. There are multiple ways this could be done, and the most obvious is the process’s active code page (ACP), which is exactly what happens. As a consequence, the specific DLL loaded by the linker may depend on the system code page. In this article I’ll contrive such a situation.

LoadLibraryA is a similar situation, and potentially applies the code page to a longer portion of the module path. LoadLibraryW is unaffected, at least for the directly-named module, because it’s Unicode all the way through.

For my contrived demonstration I came up with two names that to English-reading eyes appears as two words with extraneous markings:

Ãµral.dll: CP-1252="C3 B5 …"
õral.dll: CP-1252="F5 …"; UTF-8="C3 B5 …"

Both end with ral.dll. I’ve included the CP-1252 encoding for the differing prefixes, and the UTF-8 encoding for the second. I’m using CP-1252 because it’s the most common system code page in the world, especially the Western hemisphere. Due to case insensitivity, the actual DLL may be named ãµral.dll — i.e. to match the second library case — but the module name must be encoded as uppercase when building the import library. Alternatively the second could be Õral.dll, particularly because I won’t use it when constructing an import library.

The plan is to store the octets C3 B5 … in the import table. A process using CP-1252 decodes it to Ãµral.dll. In the UTF-8 code page it decodes to õral.dll. For testing we can use an application manifest to control the code page for a particular PE image — a lot easier than changing the system code page. Otherwise, this trick could dynamically change the behavior of a program in response to the system code page without actually inspecting the active code page.

The libraries will have a single function get, which returns a string indicating which library was loaded:

#define X(s) #s
#define S(s) X(s)
__declspec(dllexport) char *get(void) { return S(V); }

Constructing the import library can be tricky because you must consider how the toolchain, editors, and shells decode and encode text, which may involve the build system’s code page. It’s shockingly difficult to script! Binutils dlltool cannot process these names and cannot be used at all. With bleeding edge w64devkit I could reliably construct the DLLs and import library like so, even in a script (Windows 10 and later only):

$ gcc -shared -DV=UTF-8 -o Õral.dll  detect.c
$ gcc -shared -DV=ANSI  -o Ãµral.dll detect.c -Wl,--out-implib=detect.lib

That produces two DLLs and one import library, detect.lib, with the desired module name octets. A straightforward MSVC cl invocation also works so long as it’s not from a batch file. It will quite correctly warn about the strange name situation, which I like. My test program, main.c:

#include 

char *get(void);
int main(void) { puts(get()); }

I link detect.lib when I build it:

$ cc -o main.exe main.c detect.lib

I designed peports to print non-ASCII octets unambiguously (\xXX), and it’s the only tool I know that does so:

$ peports main.exe | tail -n 2
\xc3\xb5ral.dll
        1       get

The module name has the C3 B5 … prefix octets. When I run it under my system code page, CP-1252:

$ ./main
ANSI

If I add a UTF-8 manifest, even just a “side-by-side” manifest, it loads the other library despite an identical import table:

$ cc -o main.exe main.c detect.lib libwinsane.o
$ ./main
UTF-8

Again, without the manifest, if I switched my system code page to UTF-8 then UTF-8 would still be the result.

I can’t think of much practical use for this trick outside of malware. In a real program it would be simpler to inspect code page, and there’s no benefit to avoiding such a check if it’s needed. Malware could use it to trick inspection tools and scanners that decode module names differently than the dynamic linker. Such tools often incorrectly assume UTF-8, which is what motivated this article.

Giving C++ std::regex a C makeover

2024-09-04T17:15:07Z

Suppose you’re working in C using one of the major toolchains — that is, it’s mainly a C++ implementation — and you need regular expressions. You could integrate a library, but there’s a regex implementation in the C++ standard library included with your compiler, just within reach. As a resourceful engineer, using an asset already in hand seems prudent. But it’s a C++ interface, and you’re using C instead of C++ for a reason, perhaps to avoid dealing with C++. Have no worries. This article is about wrapping std::regex in a tidy C interface which not only hides all the C++ machinery, but utterly tames it. It’s not so much practical as a potpourri of interesting techniques.

If you’d like to skip ahead, here’s the full source up front. Tested with w64devkit, MSVC cl, and clang-cl: scratch/regex-wrap

Interface design

The C interface I came up with, regex.h:

#pragma once
#include 

#define S(s) (str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} str;

typedef struct {
    char *beg;
    char *end;
} arena;

typedef struct regex regex;

typedef struct {
    str      *data;
    ptrdiff_t len;
} strlist;

regex  *regex_new(str, arena *);
strlist regex_match(regex *, str, arena *);

Longtime readers will find it familiar: my favorite non-owning, counted strings form in place of null-terminated strings — similar to C++ std::string_view — and arena allocation. Yes, such fundamental types wouldn’t “belong” to a regex library like this, but imagine they’re standardized by the project or whatever. Also, this is purely a C header, not a C/C++ polyglot, and will not be used by the C++ portion.

In particular note the lack of “free” functions. The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc. So in a sense, it could be called a non-allocating library. This requires a bit of C++ abuse: I will not call some C++ regex destructors. It shouldn’t matter because they only redundantly manage memory in the arena. (If regex objects are holding file handles or something else unnecessary then its implementation so poor as to not be worth using, and we should just use a better regex library.)

Now’s a good time to mention a caveat: In order to pull this off the regex library lives in its own Dynamic-Link Library with its own copy of the C++ standard library, i.e. statically linked. My demo is Windows-only, but this concept theoretically extends to shared objects on Linux. Since it’s a C interface that doesn’t expose standard library objects, the DLL can be used by programs compiled with different toolchains. Though that wouldn’t apply to my inciting hypothetical.

Example usage:

regex  *re = regex_new(S("(\\w+)"), perm);
str     s  = S("Hello, world! This is a test.");
strlist m  = regex_match(re, s, perm);
for (ptrdiff_t i = 0; i < m.len; i++) {
    printf("%2td = %.*s\n", i, (int)m.data[i].len, m.data[i].data);
}

This program prints:

= Hello
= world
= This
= is
= a
= test

If matching lots of source strings, scope the arena to the loop and then the results, and any regex working memory, are automatically freed in O(1) at the end of each iteration:

for (ptrdiff_t i = 0; i < ninputs; i++) {
    arena   scratch = *perm;
    strlist matches = regex_match(re, inputs[i], &scratch);
    // ... consume matches ...
}

C++ implementation

On the C++ side the first thing I do is replace new and delete, which is how I force it to allocate from the arena. This replaces new/delete for globally, but recall that the regex library has its own, private C++ implementation. Replacements apply only to itself even if there’s other C++ present in the process. If this is the only C++ in the process then it doesn’t require such careful isolation.

I can’t tell std::regex about the arena — it calls operator new the usual way, without extra arguments — so I have to smuggle it in through a thread-local variable:

static thread_local arena *perm;

If I’m sure the library is only used by a single thread then I can omit thread_local, but it’s useful here to demonstrate and measure. Using it in my operator replacements:

void *operator new(size_t size, std::align_val_t align)
{
    arena    *a     = perm;
    ptrdiff_t ssize = size;
    ptrdiff_t pad   = (uintptr_t)a->end & ((int)align - 1);
    if (ssize < 0 || ssize > a->end - a->beg - pad) {
        throw std::bad_alloc{};
    }
    return a->end -= size + pad;
}

void *operator new(size_t size)
{
    return operator new(
        size,
        std::align_val_t(__STDCPP_DEFAULT_NEW_ALIGNMENT__)
    );
}

Starting in C++17, replacing the global allocator requires definitions for both plain new/delete and aligned new/delete. The many other variants, including arrays, call these four and so may be skipped. Allocating over-aligned objects isn’t a special case for arenas, so I implemented plain new by calling aligned new. I’d prefer to allocate through a template so that I can “see” the type, but that’s not an option in this case.

After converting to signed sizes because they’re simpler, it’s the usual from-the-end allocation. I prefer -fno-exceptions but std::regex is inherently exceptional — and I mean that in at least two bad ways — so they’re required. The good news is this library gracefully and reliably handles out-of-memory errors. (The arena makes this trivial to test, so try it for yourself!)

I added a little extra flair replacing delete:

void operator delete(void *) noexcept {}
void operator delete(void *, std::align_val_t) noexcept {}

void operator delete(void *p, size_t size) noexcept
{
    arena *a = perm;
    if (a->end == (char *)p) {
        a->end += size;
    }
}

The two mandatory replacements are no-ops because that’s simply how arenas work. We don’t free individual objects, but many at once. It’s completely optional, but I also replaced sized delete for little other reason than sized deallocation is cool. C++ destructs in reverse order, so this is likely to work out. At least with GCC libstdc++, it freed about a third of the workspace memory before returning to C. I’d rather it didn’t try to free anything at all, but since it’s going to call delete anyway I can get some use out of it.

Interesting side note: In a rough benchmark these replacements made MSVC std::regex matching four times faster! I expected a small speedup, but not that. In the typical case it appears to be wasting most of its time on allocation. On the other hand, libstdc++ std::regex is overall quite a bit slower than MSVC, and my replacements had no performance effect. It’s spending its time elsewhere, and the small gains are lost interacting with the thread-local.

Finally the meat:

extern "C" std::regex *regex_new(str re, arena *a)
{
    perm = a;
    try {
        return new std::regex(re.data, re.data+re.len);
    } catch (...) {
        return {};
    }
}

It sets the thread-local to the arena, then constructs with “iterators” at each end of the input. All exceptions are caught and turned into a null return. Depending on need, we may want to indicate why it failed — out of memory, invalid regex, etc. — by returning an error value of some sort. An exercise for the reader.

The matcher is a little more complicated:

extern "C" strlist regex_match(std::regex *re, str s, arena *a)
{
    perm = a;
    try {
        std::cregex_iterator it(s.data, s.data+s.len, *re);
        std::cregex_iterator end;

        strlist r = {};
        r.len  = std::distance(it, end);
        r.data = new str[r.len]();
        for (ptrdiff_t i = 0; it != end; it++, i++) {
            r.data[i].data = s.data + it->position();
            r.data[i].len  = it->length();
        }
        return r;

    } catch (...) {
        return {};
    }
}

I create a char * “cregex” iterator, again giving it each end of the input. I hope it’s not just making a copy (MSVC std::regex does grumble grumble). The result is allocated out of the arena. As before, exceptions convert to a null return. Callers can distinguish errors because no-match results have a non-null pointer. The iterator, being a local variable, is destroyed before returning, uselessly calling delete. I could avoid this by allocating it with new, but in practice it doesn’t matter.

You might have noticed the lack of declspec(dllexport). DEF files are great, and I’ve come to appreciate and prefer them. GCC and MSVC accept them as another input on the command line, and the source need not be aware exports. My regex.def:

LIBRARY regex
EXPORTS
regex_new
regex_match

In w64devkit, the command to build the DLL:

$ g++ -shared -std=c++17 -o regex.dll regex.cpp regex.def

The MSVC command almost maps 1:1 to the GCC command:

$ cl /LD /std:c++17 /EHsc regex.cpp regex.def

In either case only the C interface is exported (via peports):

$ peports -e regex.dll
EXPORTS
        1       regex_match
        2       regex_new

Reasons against

Though this library is conveniently on hand, and my minimalist C wrapper interface is nicer than a typical C regex library interface, and even hides some std::regex problems, trade-offs must be considered:

No Unicode support, particularly UTF-8
std::regex implementations are universally poor and slow
libstdc++ std::regex is especially slow to compile
Isolating in a DLL (if needed) is inconvenient
DLL is 200K (MSVC) to 700K (GCC) or so

Depending on what I’m doing, some of these may have me looking elsewhere.

Deep list copy: More than meets the eye

2024-07-31T18:49:57Z

I recently came across a take-home C programming test which had more depth and complexity than I suspect the interviewer intended. While considering it, I also came up with a novel, or at least unconventional, solution. The problem is to deep copy a linked list where each node references a random list element in addition to usual linkage — similar to LeetCode problem 138. This reference is one of identity rather than value, which has murky consequences.

typedef struct node node;
struct node {
    node *next;
    node *ref;   // arbitrary node in the list, or null
};

node *deepcopy(node *);

In the copy, nodes have individual lifetimes allocated using malloc which the caller is responsible for freeing. While thickheaded, this is conventional, and I cannot blame the test’s designer for sticking to familiar textbook concepts. My special solution handles this constraint in stride. (In a well-written program the whole list would have a single lifetime likely shared with yet more objects.)

Ignoring ref, copying the normal list linkage is trivial. Walk the original list, allocate a new node each iteration, and append it to the result. The hard part is resolving ref. Given an arbitrary node pointer, we must determine to which of the original list nodes it points, then find the node at the matching position in the new list. Naively we could scan the old list to search for a match:

node *old = oldlist;
node *new = newlist;
for (;;) {
    if (old->ref) {
        node *findold = oldlist;
        node *findnew = newlist;
        for (;;) {
            if (old->ref == findold) {
                new->ref = findnew;
                break;
            }
            findold = findold->next;
            findnew = findnew->next;
        }
    }
    old = old->next;
    new = new->next;
}

The nested loops are obviously quadratic time. That won’t scale well. To do better we need some way to map, by identity, old list nodes onto new list nodes. However, pointers do not necessarily have a value on which we could key a map. Other languages do not even expose such a concept, or at least hide it behind some “unsafe” mechanism. In that case it seems the best we could do is quadratic time.

Solution by temporary mutation

If we’re free to temporarily modify the original list, then we can use memory as a map. After all, memory itself is a kind of pointer-to-object map! Since we only get one such map per process, we’ll need to commandeer the original list during the copy. The trick is to interleave the two lists when constructing the new list:

old1 -> new1 -> old2 -> new2 -> ... -> null

That might look like (note the double-skip per iteration):

for (node *old = oldlist; old; old = old->next->next) {
    node *new = malloc(sizeof(*new));
    new->ref  = 0;
    new->next = old->next;
    old->next = new;
}

When we have a pointer to an old list node, the node itself points to the matching new list node.

for (node *old = oldlist; old; old = old->next->next) {
    if (old->ref) {
        old->next->ref = old->ref->next;
    }
}

Then before returning we’d need to deinterleave the lists, restoring the old list and separating it from the result. This solution is linear time and doesn’t require dealing with the concept of identity. Though modifying the original list isn’t always possible. That won’t work if it’s accessed concurrently — shared with another thread, accessed in a signal handler, or something else reentrant — or if it’s in read-only memory.

Solution by intrusive hash map

If we can obtain a stable value from a pointer, i.e. uintptr_t — in practice virtually always true — then there’s an interesting O(n log n) solution using an intrusive map which doesn’t modify the original list. This is my own novel solution. The result will be simultaneously a linked list and a hash map, and the caller won’t even know it! Because the map is built into the list, with a caller-managed lifetime, we won’t free anything before returning.

To start, linked list nodes are embedded at the front of hash trie nodes. The caller will see this initial field, but not the hash trie fields. Being at the front, the caller can still free them by this “internal” pointer, which allows the hash trie to be invisible.

typedef struct map map;
struct map {
    node  new;
    map  *child[4];
    node *old;
};

The “key” is old and the “value” is new. Lookup and insert use the usual “upsert” construction oriented around zero-initialization:

node *upsert(map **m, node *old)
{
    if (!old) {
        return 0;  // map null to null
    }

    uint64_t hash = (uintptr_t)old * 1111111111111111111ull;
    for (; *m; hash <<= 2) {
        if (old == (*m)->old) {
            return &(*m)->new;
        }
        m = &(*m)->child[hash>>62];
    }

    *m = calloc(1, sizeof(map));
    (*m)->old = old;
    return &(*m)->new;
}

If the matching node doesn’t yet exist, the function creates it. Also note how it returns an internal pointer. With “upsert” semantics, loop copying is trivialized:

node *deepcopy(node *head)
{
    map *m = 0;
    for (node *old = head; old; old = old->next) {
        node *new = upsert(&m, old);
        new->next = upsert(&m, old->next);
        new->ref  = upsert(&m, old->ref);
    }
    return upsert(&m, head);
}

These easy-to-implement hash tries continue to be generally useful and elegant, even with traditional memory management. Cloneable, runnable source with tests is available as a gist if you’d like to play around with it yourself.

Arenas and the almighty concatenation operator

2024-05-25T00:00:00Z

I continue to streamline an arena-based paradigm, and stumbled upon a concise technique for dynamic growth — an efficient, generic “concatenate anything to anything” within an arena built atop a core of 9-ish lines of code. The key insight originated from a reader suggestion about dynamic arrays. The subject of concatenation can be a string, dynamic array, or even something else. The “system” is extensible, and especially useful for path handling.

Continuing from last time, the examples are in light, C-style C++. I chose it because templates and function overloading express the concepts succinctly. It uses no standard library functionality, so converting to C, or similar, should be straightforward. The core concatenation “operator”:

template<typename T>
T concat(arena *a, T head, T tail)
{
    if ((char *)(head.data+head.len) != a->beg) {
        head = T{a, head};
    }
    head.len += T{a, tail}.len;
    return head;
}

This concatenates two objects of the same type in the arena, and does so in place if possible. That is, we can efficiently build a value piece by piece. The type T must have data and len members, and a “copy” constructor that makes a copy of the given object at the front of the arena. Size integer overflows and out-of-memory errors are, as usual, handled by the arena. In particular, note that the len addition happens after allocation.

Since the front-of-the-arena business implicit, consider asserting it if you’re worried. I’ve also considered declaring a clone “operator” where that behavior is an explicit part of its interface.

// Make a copy of the object at the front of the arena.
template<typename T> T clone(arena *, T);

// In concat, replace the T{} constructors with clone:
    head = clone(a, head);
    head.len += clone(a, tail).len;

Strings are perhaps them most interesting subject of concatenation. Here’s a compatible string, str, definition from my previous article:

struct str {
    union {
        uint8_t    *data = 0;
        char const *cdata;
    };
    ptrdiff_t len = 0;

    str() = default;

    str(uint8_t *beg, uint8_t *end) : data{beg}, len{end-beg} {}

    template<ptrdiff_t N>
    constexpr str(char const (&s)[N]) : cdata{s}, len{N-1} {}

    str(arena *, str);  // TODO

    uint8_t &operator[](ptrdiff_t i) { return data[i]; }
};

This has data, len, and the necessary constructor declaration. Before showing the constructor definition, here’s an arena following the usual formula, which should be familiar to those who’ve been following along:

struct arena {
    char *beg;
    char *end;
};

template<typename T, typename ...A>
T *makefront(ptrdiff_t count, arena *a, A ...args)
{
    ptrdiff_t size  = sizeof(T);
    ptrdiff_t align = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - align)/size);  // OOM
    T *r = (T *)(a->beg + align);
    a->beg += align + size*count;
    for (ptrdiff_t i = 0; i < count; i++) {
        new (r+i) T(args...);
    }
    return r;
}

Note how it bumps beg, not end, because it’s allocated at the front. That opens the end of the object for concatenation. When it returns, beg points just past the end of the new object, aligned to it. Later, concat inspects beg to see if it can extend in place. That will be true if nothing else has been allocated at the front in the meantime. That is, we can allocate objects at the end — such as hash map nodes — while efficiently growing an object at the front through concatenation. If it’s not true for whatever reason, concatenation still works, just with reduced efficiency.

With that out of the way, the “copy” constructor is simple:

str::str(arena *a, str s)
{
    data = makefront<uint8_t>(s.len, a);
    len = s.len;
    for (ptrdiff_t i = 0; i < len; i++) {
        data[i] = s[i];
    }
}

That’s everything we need to put it into action. For example, a function that deletes a file at a path following a path template.

char *tocstr(arena *a, str s)
{
    return (char *)concat(a, s, str{"\0"}).data;
}

bool removeconfig(str home, str program, arena scratch)
{
    str path = {};
    path = concat(&scratch, path, home);
    path = concat(&scratch, path, str{"/.config/"});
    path = concat(&scratch, path, program);
    path = concat(&scratch, path, str{"/rc"});
    return !unlink(tocstr(&scratch, path));
}

First, concat does all the heavy lifting in a null-terminated “C string” conversion function that operates in place if possible. In removeconfig I construct a path from path components, starting from a zero-initialized null string. In the first concat, this null string is “copied” into the arena, laying a foundation for additional concatenations. Each path component is copied in place, so unlike a dumb strcat, it’s not quadratic.

Even more, notice it supports arbitrary path lengths. No PATH_MAX, MAX_PATH, etc., it grows into the arena as needed. No huge stack variables necessary, and the scratch arena automatically frees the path on return. Fancier yet, imagine a variadic function that glues path components together with the proper path delimiter, and it wouldn’t involve a single, error-prone size calculation.

The str{} business is unfortunate. The char array constructor normally kicks in in these situations, but compilers can’t resolve the template without an explicit str object. Perhaps there’s a workaround, but I’m not yet savvy enough with C++ to figure it out. In the C version you’d always need to wrap those literals in the string macro.

Extending concatenation

The “operator” can be extended by defining more overloads. For example, to concatenate 32-bit integers to a string:

str concat(arena *a, str s, int32_t x)
{
    uint8_t  buf[16];
    uint8_t *end = buf + countof(buf);
    uint8_t *beg = end;
    int32_t  neg = x<0 ? x : -x;
    do {
        *--beg = '0' - neg%10;
    } while (neg /= 10);
    if (x < 0) {
        *--beg = '-';
    }
    return concat(a, s, {beg, end});
}

Now we can, say, construct a randomly-generated temporary path:

str path = {};
path = concat(&scratch, path, tempdir);
path = concat(&scratch, path, str{"/temp"});
int32_t id = rand32(&rng);
path = concat(&scratch, path, id);

Keep adding more definitions like this and you’ll have something like, or complementing, buffered output. It doesn’t stop there. Code points concatenated as UTF-8:

str concat(arena *a, str s, char32_t rune)
{
    enum { REPLACEMENT_CHARACTER = 0xfffd };
    if (rune>=0xd800 && rune<=0xdfff) {
        rune = REPLACEMENT_CHARACTER;
    }

    uint8_t  buf[4];
    uint8_t *end = 0;
    if (rune < 0x80) {
        buf[0] = rune;
        end = buf + 1;
    } else if (rune < 0x800) {
        buf[0] =  (rune >>  6)         | 0xc0;
        buf[1] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 2;
    } else if (rune < 0x10000) {
        buf[0] =  (rune >> 12)         | 0xe0;
        buf[1] = ((rune >>  6) & 0x3f) | 0x80;
        buf[2] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 3;
    } else {
        buf[0] =  (rune >> 18)         | 0xf0;
        buf[1] = ((rune >> 12) & 0x3f) | 0x80;
        buf[2] = ((rune >>  6) & 0x3f) | 0x80;
        buf[3] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 4;
    }
    return concat(a, s, {buf, end});
}

That composes well for general UTF-8 handling. For example, to ingest Win32 strings (arguments, paths, etc.):

str convert(arena *perm, char16_t *s)
{
    str r = {};
    while (*s) {
        char32_t rune = decode(&s);
        r = concat(perm, r, rune);
    }
    return r;
}

Beyond strings

One of my most useful C++ templates has been a span structure:

template<typename T>
struct span {
    T        *data = 0;
    ptrdiff_t len  = 0;

    span() = default;

    span(T *beg, T *end) : data{beg}, len{end-beg} {}

    span(arena *, span);  // for concat

    T &operator[](ptrdiff_t i) { return data[i]; }
};

The span::span definition looks exactly like str::str. In fact, we could nearly define strings as uint8_t spans:

typedef span<uint8_t> str;  // hypothetical

Though I’ve found strings to be just special enough not to be worth it.

This span definition is now fleshed out sufficiently to use concat with no additional definitions! However, outside of strings, concatenating spans is unusual. More often we want to append individual elements. Again, we can build on that core concat template:

template<typename T>
span<T> concat(arena *a, span<T> s, T v)
{
    return concat(a, s, span{&v, &v+1});
}

Now span is ready for 99% of its use cases. For example:

    span<int32_t> squares;
    for (int32_t i = 1; i <= 1000; i++) {
        squares = concat(&scratch, squares, i*i);
    }

It’s often good enough, but it’s not ideal as a general purpose dynamic array. Each append makes a trip through arena allocation, and this span cannot efficiently shrink and then grow again. Sometimes we’d like to track capacity, covering both those cases.

template<typename T>
struct list {
    T        *data = 0;
    ptrdiff_t len  = 0;
    ptrdiff_t cap  = 0;

    list() = default;

    list(arena *, list);  // for concat

    T &operator[](ptrdiff_t i) { return data[i]; }
};

Unfortunately cap is a curve ball that the core template can’t handle, requiring a slightly more complex definition. Since concatenating whole list objects is unusual, a definition for appending single elements:

template<typename T>
list<T> concat(arena *a, list<T> s, T v)
{
    if (s.len == s.cap) {
        if ((char *)(s.data+s.len) != a->beg) {
            s = list<T>{a, s};
        }
        ptrdiff_t extend = s.cap ? s.cap : 4;
        makefront<T>(extend, a);
        s.cap += extend;
    }
    s[s.len++] = v;
    return s;
}

Note how inside the if it’s basically the same core definition. As before, this definition extends in place if possible, but otherwise handles it correctly anyway. In addition the above concerns, this list is more suited to having multiple “open” dynamic arrays at once.

This concatenative concept has been a useful way to think about a variety of situations in order to solve them effectively with arena allocation.

Update: NRK sharply points out that “extend in place” as expressed in concat is incompatible with the alloc_size and malloc GCC function attributes, which I’ve suggested in the past. While considering how to mitigate this, we’ve also discovered that alloc_size has always been fundamentally broken in GCC. Correct use is impossible, and so it must not be used.

Guidelines for computing sizes and subscripts

2024-05-24T22:25:10Z

Occasionally we need to compute the size of an object that does not yet exist, or a subscript that may fall out of bounds. It’s easy to miss the edge cases where results overflow, creating a nasty, subtle bug, even in the presence of type safety. Ideally such computations happen in specialized code, such as inside an allocator (calloc, reallocarray) and not outside by the allocatee (i.e. malloc). Mitigations exist with different trade-offs: arbitrary precision, or using a wider fixed integer — i.e. 128-bit integers on 64-bit hosts. In the typical case, working only with fixed size-type integers, I’ve come up with a set of guidelines to avoid overflows in the edge cases.

Range check before computing a result. No exceptions.
Do not cast unless you know a priori the operand is in range.
Never mix unsigned and signed operands. Prefer signed. If you need to convert an operand, see (2).
Do not add unless you know a priori the result is in range.
Do not multiply unless you know a priori the result is in range.
Do not subtract unless you know a priori both signed operands are non-negative. For unsigned, that the second operand is not larger than the first (treat it like (4)).
Do not divide unless you know a prior the denominator is positive.
Make it correct first. Make it fast later, if needed.

These guidelines are also useful when reviewing code, tracking in your mind whether the invariants are held at each step. If not, you’ve likely found a bug. If in doubt, use assertions to document and check invariants. I compiled this list during code review, so for me that’s where it’s most useful.

Range check, then compute

Not strictly necessary when overflow is well-defined, i.e. wraparound, but it’s like defensive driving. It’s simpler and clearer to check with basic arithmetic rather than reason from a wraparound, i.e. a negative result. Checked math functions are fine, too, if you check the overflow boolean before accessing the result.

// bad
len++;
if (len <= 0) error();

// good
if (len == MAX) error();
len++;

Casting

Casting from signed to unsigned, it’s as simple as knowing the value is non-negative, which is likely if you’re following (1). If a negative size has appeared, there’s already been a bug earlier in the program, and the only reasonable course of action is to abort, not handle it like an error.

Addition

To check if addition will overflow, subtract one of the operands from the maximum value.

if (b > MAX - a) error();
r = a + b;

In pointer arithmetic addition, it’s a common mistake to compute the result pointer then compare it to the bounds. If the check failed, then the pointer already overflowed, i.e. undefined behavior. Major pieces software, like glibc, are riddled with such pointer overflows. (Now that you’re aware of it, you’ll start noticing it everywhere. Sorry.)

// bad: never do this
beg += size;
if (beg > end) error();

To do this correctly, check integers not pointers. Like before, subtract before adding.

available = end - beg;
if (size > available) error();
beg += size;

Mind mixing signed and unsigned operands for the comparison operator (3), e.g. an unsigned size on the left and signed difference on the right.

Multiplication and division

If you’re working this out on your own, multiplication seems tricky until you’ve internalized a simple pattern. Just as we subtracted before adding, we need to divide before multiplying. Divide the maximum value by one of the operands:

if (a>0 && b>MAX/a) error();
r = a * b;

It’s often permitted for one or both to be zero, so mind divide-by-zero, which is handled above by the first condition. Sometimes size must be positive, e.g. the result of the sizeof operator in C, in which case we should prefer it as the denominator.

assert(size  >  0);
assert(count >= 0);
if (count > MAX/size) error();
total = count * size;

With arena allocation there are usually two concerns. First, will it overflow when computing the total size, i.e. count * size? Second, is the total size within the arena capacity. Naively that’s two checks, but we can kill two birds with one stone: Check both at once by using the current arena capacity as the maximum value when considering overflow.

if (count > (end - beg)/size) error();
total = count * size;

One condition pulling double duty.

Subtraction

With signed sizes, the negative range is a long “runway” allowing a single unchecked subtraction before overflow might occur. In essence, we were exploiting this in order to check addition. The most common mistake with unsigned subtraction is not accounting for overflow when going below zero.

// note: signed "i" only
for (i = end - stride; i >= beg; i -= stride) ...

This loop will go awry if i is unsigned and beg <= stride.

In special cases we can get away with a second subtraction without an overflow check if we know some properties of our operands. For example, my arena allocators look like this:

padding = -beg & (align - 1);
if (count >= (end - beg - padding)/size) error();

That’s two subtractions in a row. However, end - beg describes the size of a realized object, and align is a small constant (e.g. 2^(0–6)). It could only overflow if the entirety of memory was occupied by the arena.

Bonus, advanced note: This check is actually pulling triple duty. Notice that I used >= instead of >. The arena can’t fill exactly to the brim, but it handles the extreme edge case where count is zero, the arena is nearly full, but the bump pointer is unaligned. The result of subtracting padding is negative, which rounds to zero by integer division, and would pass a > check. That wouldn’t be a problem except that aligning the bump pointer would break the invariant beg <= end.

Try it for yourself

Next time you’re reviewing code that computes sizes or subscripts, bring the list up and see how well it follows the guidelines. If it misses one, try to contrive an input that causes an overflow. If it follows guidelines and you can still contrive such an input, then perhaps the list could use another item!

Speculations on arenas and custom strings in C++

2024-04-14T00:39:18Z

Update September 2025: This article has a followup with corrections.

My techniques with arena allocation and strings are oriented around C. I’m always looking for a better way, and lately I’ve been experimenting with building them using C++ features. What are the trade-offs? Are the benefits worth the costs? In this article I lay out my goals, review implementation possibilities, and discuss my findings. Following along will require familiarity with those previous two articles.

Some of C++ is beyond my mental capabilities, and so I cannot wield those parts effectively. Other parts I can wrap my head around, but it requires substantial effort and the inevitable mistakes are difficult to debug. So a general goal is to minimize contact with that complexity, only touching a few higher-value features that I can use confidently.

Existing practice is unimportant. I’ve seen where that goes. Like the C standard library, the C++ standard library offers me little. Its concepts regarding ownership and memory management are irreconcilable (move semantics, smart pointers, etc.), so I have to build from scratch anyway. So absolutely no including C++ headers. The most valuable features are built right into the language, so I won’t need to include library definitions.

No public or private. Still no const beyond what is required to access certain features. This means I can toss out a bunch of keywords like class, friend, etc. It eliminates noisy, repetitive code and interfaces — getters, setters, separate const and non-const — which in my experience means fewer defects.

No references beyond mandatory cases. References hide addresses being taken — or merely implies it, when it’s actually an expensive copy — which is an annoying experience when reading unfamiliar C++. After all, for arenas the explicit address-taking (permanent) or copying (scratch) is a critical part of communicating the interfaces.

In theory constexpr could be useful, but it keeps falling short when I try it out, so I’m ignoring it. I’ll elaborate in a moment.

Minimal template use. They blow up compile times and code size, they’re noisy, and in practice they make debug builds (i.e. -O0) much slower (typically ~10x) because there’s no optimization to clean up the mess. I’ll only use them for a few foundational purposes, such as allocation. (Though this article is about the fundamental stuff.)

No methods aside from limited use of operator overloads. I want to keep a C style, plus methods just look ugly without references: obj->func() vs. func(obj). (Why are we still writing -> in the 21st century?) Function overloading can instead differentiate “methods.” Overloads are acceptable in moderation, especially because I’m paying for it (symbol decoration) whether or not I take advantage.

Finally, no exceptions of course. I assume -fno-exceptions, or the local equivalent, is active.

Allocation

Let’s start with allocation. Since writing that previous article, I’ve streamlined arena allocation in C:

#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

typedef struct {
    byte *beg;
    byte *end;
} arena;

static byte *alloc(arena *a, size objsize, size align, size count)
{
    assert(count >= 0);
    size pad = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    return memset(a->end -= objsize*count + pad, 0, objsize*count);
}

(As needed, replace the second assert with whatever out of memory policy is appropriate.) Then allocating, say, a 10k-element hash table (i.e. to keep it off the stack):

    i16 *seen = new(&scratch, i16, 1<<14);

With C++, I initially tried placement new with the arena as the “place” for the allocation:

void *operator new(size_t, arena *);  // avoid this

Then to create a single object:

    object *o = new (&scratch) object{};

This exposes the constructor, but everything else about it is poor. It relies on complex, finicky rules governing new overloads, especially for alignment handling. It’s difficult to tell what’s happening, and it’s too easy to make mistakes that compile. That doesn’t even count the mess that is array new[].

I soon learned it’s better to replace the new macro with a template, which can actually see what it’s doing. I can’t call it new in C++, so I settled on make instead:

template<typename T>
static T *make(arena *a, size count = 1)
{
    assert(count >= 0);
    size objsize = sizeof(T);
    size align   = alignof(T);
    size pad     = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    a->end -= objsize*count + pad;
    T *r = (T *)a->end;
    for (size i = 0; i < count; i++) {
        new ((void *)&r[i]) T{};
    }
    return r;
}

Then allocating that hash table becomes:

    i16 *seen = make<i16>(&scratch, 10000);

Or a single object, relying on the default argument:

    object *o = make<object>(&scratch);

Due to placement new, merely for invoking the constructor, these objects aren’t just zero-initialized, but value-initialized. It can only construct objects that define an empty initializer, but in exchange unlocks some interesting possibilities:

struct mat3 {
    f32 data[9] = {
        1, 0, 0,
        0, 1, 0,
        0, 0, 1,
    };
};

struct list {
    node  *head = 0;
    node **tail = &head;
};

When a zero-initialized state isn’t ideal, objects can still initialize to a more useful state straight out of the arena. The second case is even self-referencing, which is specifically supported through placement new. Otherwise you’d need a special-written copy or move constructor.

make could accept constructor arguments and perfect forward them to a constructor. However, that’s too far into the dark arts for my comfort, plus it requires a correct definition of std::forward. In practice that means #include-ing it, and whatever comes in with it. Or ask an expert capable of writing such a definition from scratch, though both are probably too busy.

Update 1: One of those experts, Jonathan Müller, kindly reached out to say that a static cast is sufficient. This is easy to do:

template<typename T, typename ...A>
static T *make(arena *a, size count = 1, A &&...args)
{
    // ...
        new ((void *)&r[i]) T{(A &&)args...};
    // ...
}

Update 2: I later realized that because I do not care about copy or move semantics, I also don’t care about perfect forwarding. I can simply expand the parameter pack without casting or &&. I also don’t want the extra restrictions on braced initializer conversions, so better to use parentheses with new.

template<typename T, typename ...A>
static T *make(arena *a, size count = 1, A ...args)
{
    // ...
        new ((void *)&r[i]) T(args...);
    // ...
}

One small gotcha: placement new doesn’t work out of the box, and you need to provide a definition. That means including or writing one out. Fortunately it’s trivial, but the prototype must exactly match, including size_t:

void *operator new(size_t, void *p) { return p; }

Overall I feel the template is a small improvement over the macro.

Strings

Recall my basic C string type, with a macro to wrap literals:

#define countof(a)  (size)(sizeof(a) / sizeof(*(a)))
#define s8(s)       (s8){(u8 *)s, countof(s)-1}

typedef struct {
    u8  *data;
    size len;
} s8;

Since it doesn’t own the underlying buffer — region-based allocation has already solved the ownership problem — this is what C++ long-windedly calls a std::string_view. In C++ we won’t need the countof macro for strings, but it’s still generally useful. Converting it to a template, which is theoretically more robust (rejects pointers), but comes with a non-zero cost:

template<typename T, size N>
size countof(T (&)[N])
{
    return N;
}

The reference — here a reference to an array — is unavoidable, so it’s one of the rare cases. The same concept applies as an s8 constructor to replace the macro:

struct s8 {
    u8  *data = 0;
    size len  = 0;

    s8() = default;

    template<size N>
    s8(const char (&s)[N]) : data{(u8 *)s}, len{N-1} {}
};

I’ve explicitly asked to keep a default zero-initialized (empty) string since it’s useful — and necessary to directly allocate strings using make, e.g. an array of strings. const is required because string literals are const in C++, but it’s immediately stripped off for the sake of simplicity. The new constructor allows:

    s8 version = "1.2.3";

Or even more usefully:

    void print(bufout *, s8);
    // ...
    print(stdout, "hello world\n");

Define operator== and it’s more useful yet:

    b32 operator==(s8 s)
    {
        return len==s.len && (!len || !memcmp(data, s.data, len));
    }

Now this works, and it’s cheap and fast even in debug builds:

    s8 key = ...;
    if (key == "HOME") {
        // ...
    }

That’s more ergonomic than the macro and comparison function. operator[] also improves ergonomics, to subscript a string without going through the data member:

    u8 &operator[](size i)
    {
        assert(i >= 0);
        assert(i < len);
        return data[i];
    }

The reference is again necessary to make subscripts assignable. Since s8span — make a string spanning two pointers — so often appears in my programs, a constructor seems appropriate, too:

    s8(u8 *beg, u8 *end)
    {
        assert(beg <= end);
        data = beg;
        len = end - beg;
    }

By the way, these assertions I’ve been using are great for catching mistakes quickly and early, and they complement fuzz testing.

I’m not sold on it, but an idea for the future: C++23’s multi-index operator[] as a slice operator:

    s8 operator[](size beg, size end)
    {
        assert(beg >= 0);
        assert(beg <= end);
        assert(end <= len);
        return {data+beg, data+end};
    }

Then:

    s8 msg = "foo bar baz";
    msg = msg[4,7];  // msg = "bar"

I could keep going with, say, iterators and such, but each will be more specialized and less useful. (I don’t care about range-based for loops.)

Downside: static initialization

The new string stuff is neat, but I hit a wall trying it out: These fancy constructors do not reliably construct at compile time, not even with a constexpr qualifier in two of the three major C++ implementations. A static lookup table that contains a string is likely constructed at run time in at least some builds. For example, this table:

static s8 keys[] = {"foo", "bar", "baz"};

Requires run-time construction in real world cases I care about, requiring C++ magic and linking runtime gunk. The constructor is therefore a strict downgrade from the macro, which works perfectly in these lookup tables. Once a non-default constructor is defined, I’ve been unable to find an escape hatch back to the original, dumb, reliable behavior.

Update: Jonathan Müller points out the reinterpret cast is forbidden in a constexpr function, so it’s not required to happen at compile time. After some thought, I’ve figured out a workaround using a union:

struct s8 {
    union {
        u8         *data = 0;
        const char *cdata;
    };
    size len = 0;

    template<size N>
    constexpr s8(const char (&s)[N]) : cdata{s}, len{N-1} {}

    // ...
}

In all three C++ implementations, in all configurations, this reliably constructs strings at compile time. The other semantics are unchanged.

Other features

Having a generic dynamic array would be handy, and more ergonomic than my dynamic array macro:

template<typename T>
struct slice {
    T   *data = 0;
    size len  = 0;
    size cap  = 0;

    slice<T> = default;

    template<size N>
    slice<T>(T (&a)[N]) : data{a}, len{N}, cap{N} {}

    T &operator[](size i) { ... }
}

template<typename T>
slice<T> append(arena *, slice<T>, T);

On the other hand, hash maps are mostly solved, so I wouldn’t bother with a generic map.

Function overloads would simplify naming. For example, this in C:

prints8(bufout *, s8);
printi32(bufout *, i32);
printf64(bufout *, f64);
printvec3(bufout *, vec3);

Would hide that stuff behind the scenes in the symbol decoration:

print(bufout *, s8);
print(bufout *, i32);
print(bufout *, f64);
print(bufout *, vec3);

Same goes for a hash() function on different types.

C++ has better null pointer semantics than C. Addition or subtraction of zero with a null pointer produces a null pointer, and subtracting null pointers results in zero. This eliminates some boneheaded special case checks required in C, though not all: memcpy, for instance, arbitrarily still does not accept null pointers even in C++.

Ultimately worth it?

The static data problem is a real bummer, but perhaps it’s worth it for the other features. I still need to put it all to the test in a real, sizable project.

Protecting paths in macro expansions by extending UTF-8

2024-03-05T03:15:12Z

After a year I’ve finally came up with an elegant solution to a vexing u-config problem. The pkg-config format uses macros to generate build flags through recursive expansion. Some flags embed file system paths, but to the macro system it’s all strings. The output is also ultimately just one big string, which the receiving shell splits into fields. If a path contains spaces, or shell metacharacters, u-config must escape them so that shells treat them as part of a token. But how can u-config itself distinguish incidental spaces in paths from deliberate spaces between flags? What about other shell metacharacters in paths? My solution is to extend UTF-8 to encode metadata that survives macro expansion.

As usual, it helps to begin with a concrete example of the problem. The following is a conventional .pc file much like you’d find on your own system:

prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: Example
Version: 1.0
Description: An example .pc file
Cflags: -I${includedir}
Libs: -L${libdir} -lexample

It begins by defining the library’s installation prefix from which it derives additional paths, which are finally used in the package fields that generate build flags (Cflags, Libs). If I run u-config against this configuration:

$ pkg-config --cflags --libs example
-I/usr/include -L/usr/lib -lexample

Typically prefix is populated by the library’s build system, which knows where the library is to be installed. In some situations that’s not possible, and there is no opportunity to set prefix to a meaningful path. In that case, pkg-config can automatically override it (--define-prefix) with a path relative to the .pc file, making the installation relocatable. This works quite well on Windows, where it’s the default:

$ pkg-config --cflags --libs example
-IC:/Users/me/example/include -LC:/Users/me/example/lib -lexample

This just works… so long as the path does not contain spaces. If so, it risks splitting into separate fields. The .pc format supports quoting to control how such output is escaped. Regions between quotes are escaped in the output so that they retain their spaces when field split in the shell. If a .pc file author is careful, they’d write it with quotes:

Cflags: -I"${includedir}"
Libs: -L"${libdir}" -lexample

The paths are carefully placed within quoted regions so that they come out properly:

$ pkg-config --cflags example
-IC:/Program\ Files/example/include

Almost nobody writes their .pc files this way! The convention is not to quote. My original solution was to implicitly wrap prefix in quotes on assignment, which fixes the vast majority of .pc files. That effectively looks like this in the “virtual” .pc file:

prefix="C:/Program Files/example"
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

So the important region is quoted, its spaces preserved. However, the occasional library author actively supporting Windows inevitably runs into this problem, and their system’s pkg-config implementation does not quote prefix. They soon figure out explicit quoting and apply it, which then undermines u-config’s implicit quoting. The quotes essentially cancel out:

"$includedir" -> ""C:/Program Files/example"/include"

The quoted regions are inverted and nothing happens. Though this is a small minority, the libraries that do this and the ones you’re likely to use on Windows are correlated. I was stumped: How to support quoted and unquoted .pc files simultaneously?

Extending UTF-8

I recently had the thought: What if somehow u-config tracked which spans of string were paths. prefix is initially a path span, and then track it through macro-expansion and concatenation. Soon after that I realized it’s even simpler: Encode the spaces in a path as a value other than space, but also a value that cannot appear in the input. Recall that certain octets can never appear in UTF-8 text: the 8 values whose highest 5 bits are set. That would be the first octet of 5-octet, or longer, code point, but those are forbidden.

11111xxx

When paths enter the macro system, special characters are encoded as one of these 8 values. They’re converted back to their original ASCII values during output encoding, escaped. It doesn’t interact with the pkg-config quoting mechanism, so there’s no quote cancellation. Both quoting cases are supported equally.

For example, if space is mapped onto \xff (255), then:

in:  C:/Program Files/foo    -> C:/Program\xffFiles/foo
out: C:/Program\xffFiles/foo -> C:/Program\ Files/foo

Which prints the same regardless of ${includedir} or "${includedir}". Problem solved!

More metacharacters

That’s not the only complication. Outputs may deliberately include shell metacharacters, though typically these are Makefile fragments. For example, the default value of ${pc_top_builddir} is $(top_builddir), which make will later expand. While these characters are special to a shell, and certainly special to make, they must not be escaped.

What if a path contains these characters? The pkg-config quoting mechanism won’t help. It’s only concerned with spaces, and $(...) prints the same quoted nor not. As before, u-config must track provenance — whether or not such characters originated from a path.

If $PKG_CONFIG_TOP_BUILD_DIR is set, then pc_top_builddir is set to this environment variable, useful when the result isn’t processed by make. In this case it’s a path, and $(...) ought to be escaped. Even without $ it must be quoted, because the parentheses would still invoke a subshell. But who would put parenthesis in a path? Lo and behold!

C:/Program Files (x86)/example

Again, extending UTF-8 solves this as well: Encode $, (, and ) in paths using three of those forbidden octets, and escape them on the way out, allowing unencoded instances to go straight through.

in:  C:/Program\xffFiles\xff\xfdx86\xfe/example
out: C:/Program\ Files\ \(x86\)/example

This makes pc_top_builddir straightforward: default to a raw string, otherwise a path-encoded environment variable (note: s8 is a string type and upsert is a hash map):

    s8 top_builddir = s8("$(top_builddir)");
    if (envvar_set) {
        top_builddir = s8pathencode(envvar, perm);
    }
    *upsert(&global, s8("pc_top_builddir"), perm) = top_builddir;

For a particularly wild case, consider deliberately using a uname -m command substitution to construct a path, i.e. the path contains the target machine architecture (i686, x86_64, etc.):

Cflags: -I${prefix}/$(uname -m)/include

(Not that condone such nonsense. This is merely a reality of real world .pc files.) With prefix automatically set as above, this will print:

-IC:/Program\ Files\ \(x86\)/example/$(uname -m)/include

Path parentheses are escaped because they came from a path, but command substitution passes through because it came from the .pc source. Quite cool!

An improved chkstk function on Windows

2024-02-05T17:56:05Z

If you’ve spent much time developing with Mingw-w64 you’ve likely seen the symbol ___chkstk_ms, perhaps in an error message. It’s a little piece of runtime provided by GCC via libgcc which ensures enough of the stack is committed for the caller’s stack frame. The “function” uses a custom ABI and is implemented in assembly. So is the subject of this article, a slightly improved implementation soon to be included in w64devkit as libchkstk (-lchkstk).

The MSVC toolchain has an identical (x64) or similar (x86) function named __chkstk. We’ll discuss that as well, and w64devkit will include x86 and x64 implementations, useful when linking with MSVC object files. The new x86 __chkstk in particular is also better than the MSVC definition.

A note on spelling: ___chkstk_ms is spelled with three underscores, and __chkstk is spelled with two. On x86, cdecl functions are decorated with a leading underscore, and so may be rendered, e.g. in error messages, with one fewer underscore. The true name is undecorated, and the raw symbol name is identical on x86 and x64. Further complicating matters, libgcc defines a ___chkstk with three underscores. As far as I can tell, this spelling arose from confusion regarding name decoration, but nobody’s noticed for the past 28 years. libgcc’s x64 ___chkstk is obviously and badly broken, so I’m sure nobody has ever used it anyway, not even by accident thanks to the misspelling. I’ll touch on that below.

When referring to a particular instance, I will use a specific spelling. Otherwise the term “chkstk” refers to the family. If you’d like to skip ahead to the source for libchkstk: libchkstk.S.

A gradually committed stack

The header of a Windows executable lists two stack sizes: a reserve size and an initial commit size. The first is the largest the main thread stack can grow, and the second is the amount committed when the program starts. A program gradually commits stack pages as needed up to the reserve size. Binutils objdump option -p lists the sizes. Typical output for a Mingw-w64 program:

$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000

The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB initially committed. With the Binutils linker, ld, you can set them at link time using --stack. Via gcc, use -Xlinker. For example, to reserve an 8MiB stack and commit half of it:

$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...

MSVC link.exe similarly has /stack.

The purpose of this mechanism is to avoid paying the commit charge for unused stack. It made sense 30 years ago when stacks were a potentially large portion of physical memory. These days it’s a rounding error and silly we’re still dealing with it. Using the above options you can choose to commit the entire stack up front, at which point a chkstk helper is no longer needed (-mno-stack-arg-probe, /Gs2147483647). This requires link-time control of the main module, which isn’t always an option, like when supplying a DLL for someone else to run.

The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.

In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:

void callee(char *);

void example(void)
{
    char large[1<<20];
    callee(large);
}

Compiled with 64-bit gcc -O:

example:
    movl    $1048616, %eax
    call    ___chkstk_ms
    subq    %rax, %rsp
    leaq    32(%rsp), %rcx
    call    callee
    addq    $1048616, %rsp
    ret

I used GCC, but this is practically identical to the code generated by MSVC and Clang. Note the call to ___chkstk_ms in the function prologue before allocating the stack frame (subq). Also note that it sets eax. As a volatile register, this would normally accomplish nothing because it’s done just before a function call, but recall that ___chkstk_ms has a custom ABI. That’s the argument to chkstk. Further note that it uses rax on the return. That’s not the value returned by chkstk, but rather that x64 chkstk preserves all registers.

Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.

Implementing chkstk

In a high level language, chkstk might look something like so:

// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
    volatile char frame[frame_size];  // NOTE: variable-length array
    for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
        frame[i] = 0;  // touch the guard page
    }
}

This wouldn’t work for a number of reasons, but if it did, volatile would serve two purposes. First, forcing the side effect to occur. The second is more subtle: The loop must happen in exactly this order, from high to low. Without volatile, loop iterations would be independent — as there are no dependencies between iterations — and so a compiler could reverse the loop direction.

The store can happen anywhere within the guard page, so it’s not necessary to align frame to the page. Simply touching at least one byte per page is enough. This is essentially the definition of libgcc ___chkstk_ms.

How many iterations occur? In example above, the stack frame will be around 1MiB (2²⁰). With pages of 4KiB (2¹²) that’s 256 iterations. The loop happens unconditionally, meaning every function call requires 256 iterations of this loop. Wouldn’t it be better if the loop ran only as needed, i.e. the first time? MSVC x64 __chkstk skips iterations if possible, and the same goes for my new ___chkstk_ms. Much like the command line string, the low address of the current thread’s guard page is accessible through the Thread Information Block (TIB). A chkstk can cheaply query this address, only looping during initialization or so. (In contrast to Linux, a thread’s stack is fundamentally managed by the operating system.)

Taking that into account, an improved algorithm:

Push registers that will be used
Compute the low address of the new stack frame (F)
Retrieve the low address of the committed stack (C)
Go to 7
Subtract the page size from C
Touch memory at C
If C > F, go to 5
Pop registers to restore them and return

A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:

___chkstk_ms:
    push %rax              // 1.
    push %rcx              // 1.
    neg  %rax              // 2. rax = frame low address
    add  %rsp, %rax        // 2. "
    mov  %gs:(0x10), %rcx  // 3. rcx = stack low address
    jmp  1f                // 4.
0:  sub  $0x1000, %rcx     // 5.
    test %eax, (%rcx)      // 6. page fault (very slow!)
1:  cmp  %rax, %rcx        // 7.
    ja   0b                // 7.
    pop  %rcx              // 8.
    pop  %rax              // 8.
    ret                    // 8.

I’ve labeled each instruction with its corresponding pseudo-code. Step 6 is unusual among chkstk implementations: It’s not a store, but a load, still sufficient to fault the page. That test instruction is just two bytes, and unlike other two-byte options, doesn’t write garbage onto the stack — which would be allowed — nor use an extra register. I searched through single byte instructions that can page fault, all of which involve implicit addressing through rdi or rsi, but they increment rdi or rsi, and would would require another instruction to correct it.

Because of the return address and two push operations, the low stack frame address is technically too low by 24 bytes. That’s fine. If this exhausts the stack, the program is really cutting it close and the stack is too small anyway. I could be more precise — which, as we’ll soon see, is required for x86 __chkstk — but it would cost an extra instruction byte.

On x64, ___chkstk_ms and __chkstk have identical semantics, so name it __chkstk — which I’ve done in libchkstk — and it works with MSVC. The only practical difference between my chkstk and MSVC __chkstk is that mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking the optimization, is libgcc ___chkstk_ms, weighing 50 bytes, or in practice, due to an unfortunate Binutils default of padding sections, 64 bytes.

I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.

Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.

32-bit chkstk

On x86 ___chkstk_ms has identical semantics to x64. Mine is a copy-paste of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC was ahead of the curve on this design.

However, x86 __chkstk is bonkers. It not only commits the stack, but also allocates the stack frame. That is, it returns with a different stack pointer. The return pointer is initially inside the new stack frame, so chkstk must retrieve it and return by other means. It must also precisely compute the low frame address.

__chkstk:
    push %ecx               // 1.
    neg  %eax               // 2.
    lea  8(%esp,%eax), %eax // 2.
    mov  %fs:(0x08), %ecx   // 3.
    jmp  1f                 // 4.
0:  sub  $0x1000, %ecx      // 5.
    test %eax, (%ecx)       // 6. page fault (very slow!)
1:  cmp  %eax, %ecx         // 7.
    ja   0b                 // 7.
    pop  %ecx               // 8.
    xchg %eax, %esp         // ?. allocate frame
    jmp  *(%eax)            // 8. return

The main differences are:

eax is treated as volatile, so it is not saved
The low frame address is precisely computed with lea (2)
The frame is allocated at step (?) by swapping F and the stack pointer
Post-swap F now points at the return address, so jump through it

MSVC x86 __chkstk does not query the TIB (3), and so unconditionally runs the loop. So there’s an advantage to my implementation besides size.

libgcc x86 ___chkstk has this behavior, and so it’s also a suitable __chkstk aside from the misspelling. Strangely, libgcc x64 ___chkstk also allocates the stack frame, which is never how chkstk was supposed to work on x64. I can only conclude it’s never been used.

Optimization in practice

Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.

_Bool path_contains(wchar_t *name, wchar *path)
{
    wchar_t var[1<<15];
    GetEnvironmentVariableW(name, var, countof(var));
    // ... search for path in var ...
}

int64_t getfilesize(char *path)
{
    wchar_t wide[1<<15];
    MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
    // ... look up file size via wide path ...
}

void example(void)
{
    if (path_contains(L"PATH", L"c:\\windows\\system32")) {
        // ...
    }

    int64_t size = getfilesize("π.txt");
    // ...
}

Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.

So ultimately after writing an improved ___chkstk_ms I could only measure a tiny difference in contrived programs, and none in any real application. Though there’s still one more benefit I haven’t yet mentioned…

“The first thing we do, let’s kill all the lawyers”.

My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.

Every once in awhile ___chkstk_ms rears its ugly head, and I have to make a decision. Do I re-work my code to avoid it? Do I take the reigns of the linker and disable stack probes? I haven’t necessarily allocated a large local array: A bit of luck with function inlining can combine several smaller stack frames into one that’s just large enough to require chkstk.

Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!

(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)

MSVC is worse. Hardly anyone follows its license, but fortunately for most the license is practically unenforced. Its chkstk, which currently resides in a loose chkstk.obj, falls into what Microsoft calls “Distributable Code.” Its license requires “external end users to agree to terms that protect the Distributable Code.” In other words, if you compile a program with MSVC, you’re required to have a EULA including the relevant terms from the Visual Studio license. You’re not legally permitted to distribute software in the manner of w64devkit — no installer, just a portable zip distribution — if that software has been built with MSVC. At least not without special care which nobody does. (Don’t worry, I won’t tell.)

How to use libchkstk

To avoid libgcc entirely you need -nostdlib. Otherwise it’s implicitly offered to the linker, and you’d need to manually check if it picked up code from libgcc. If ld complains about a missing chkstk, use -lchkstk to get a definition. If you use -lchkstk when it’s not needed, nothing happens, so it’s safe to always include.

I also recently added a libmemory to w64devkit, providing tiny, public domain definitions of memset, memcpy, memmove, memcmp, and strlen. All compilers fabricate calls to these five functions even if you don’t call them yourself, which is how they were selected. (Not because I like them. I really don’t.). If a -nostdlib build complains about these, too, then add -lmemory.

$ gcc -nostdlib ... -lchkstk -lmemory

In MSVC the equivalent option is /nodefaultlib, after which you may see missing chkstk errors, and perhaps more. libchkstk.a is compatible with MSVC, and link.exe doesn’t care that the extension is .a rather than .lib, so supply it at link time. Same goes for libmemory.a if you need any of those, too.

$ cl ... /link /nodefaultlib libchkstk.a libmemory.a

While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.

Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!

Two handy GDB breakpoint tricks

2024-01-28T21:56:07Z

Over the past couple months I’ve discovered a couple of handy tricks for working with GDB breakpoints. I figured these out on my own, and I’ve not seen either discussed elsewhere, so I really ought to share them.

Continuable assertions

The assert macro in typical C implementations leaves a lot to be desired, as does raise and abort, so I’ve suggested alternative definitions that behave better under debuggers:

#define assert(c)  while (!(c)) __builtin_trap()
#define assert(c)  while (!(c)) __builtin_unreachable()
#define assert(c)  while (!(c)) *(volatile int *)0 = 0

Each serves a slightly different purpose but still has the most important property: Immediately halt the program directly on the defect. None have an occasionally useful secondary property: Optionally allow the program to continue through the defect. If the program reaches the body of any of these macros then there is no reliable continuation. Even manually nudging the instruction pointer over the assertion isn’t enough. Compilers assume that the program cannot continue through the condition and generate code accordingly.

The MSVC ecosystem has a solution for this on x86: int3. The portable name is __debugbreak, a name I’ve borrowed elsewhere.

#define assert(c)  do if (!(c)) __debugbreak(); while (0)

On x86 it inserts an int3 instruction, which fires an interrupt, trapping in the attached debugger, or otherwise abnormally terminating the program. Because it’s an interrupt, it’s expected that the program might continue. It even leaves the instruction pointer on the next instruction. As of this writing, GCC has no matching intrinsic, but Clang recently added __builtin_debugtrap. In GCC you need some less portable inline assembly: asm ("int3").

However, regardless of how you get an int3 in your program, GDB does not currently understand it. The problem is that feature I mentioned: The instruction pointer does not point at the int3 but the next instruction. This confuses GDB, causing it to break in the wrong places, possibly even in the wrong scope. For example:

for (int i = 0; i < n; i++) {
    // ...
    int3_assert(...);
}

With int3 at the very end of the loop, GDB will break at the top of the next loop iteration, because that’s where the instruction pointer lands by the time GDB is involved. It’s a similar story when placed at the end of a function, leaving GDB to break in the caller. To resolve this, we need the instruction pointer to still be “inside” the breakpoint after the interrupt fires. Easy! Add a nop:

#define breakpoint()  asm ("int3; nop")

This behaves beautifully, eliminating all the problems GDB has with a plain int3. Not only is this a solid basis for a continuable assertion, it’s also useful as a fast conditional breakpoint, where conventional conditional breakpoints are far too slow.

for (int i = 0; i < 1000000000; i++) {
    if (/* rare condition */) breakpoint();
    // ...
}

Could GDB handle int3 better? Yes! Visual Studio, for instance, does not require the nop instruction. As far as I know there is no ARM equivalent compatible with GDB (or even LLDB). The closest instruction, brk #0x1, does not behave as needed.

Named positions

GDB’s built-in user interface understands three classes of breakpoint positions: symbols, context-free line numbers, and absolute addresses. When you set some breakpoints and (re)start a program under GDB, each kind of breakpoint is handled differently:

Resolve each symbol, placing a breakpoint on its run-time address.
Map each file+lineno tuple to a run-time address, and place a breakpoint on that address. If the line does not exist (i.e. the file is shorter), skip it.
Place breakpoints exactly on each absolute address. If it’s not a mapped address, don’t start the program.

The first is the best case because it adapts to program changes. Modify the code, recompile, and the breakpoint generally remains where you want it.

The third is the least useful. These breakpoints rarely survive across rebuilds, and sometimes not even across reruns.

The second is in the middle between useful and useless. If you edit the source file which has the breakpoint — likely, because you placed the breakpoint there for a reason — chances are high that the line number is no longer correct. Instead it drifts, requiring manual replacement. This is tedious and GDB ought to do better. Think that’s unreasonable? The Visual Studio debugger does exactly that quite effectively through external code edits! GDB front ends tend to handle it better, especially when they’re also the code editor and so directly observe all edits.

As a workaround we can get the first kind by temporarily naming a line number. This requires editing the source, but remember, the very reason we need it is because the source in question is actively changing. How to name a line? C and C++ labels give a name to program position:

void example(double *nums, int n, ...)
{
    for (int i = 0; i < n; i++) {
        loop:  // named position at the start of the loop
        // ...
    }
}

The name loop is local to example, but the qualified example:loop is a global name, as suitable as any other symbol. I could, say, reliably trace the progress of this loop despite changes to its position in the source.

(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]

One downside is dealing with -Wunused-label (enabled by -Wall), and so I’ve considered disabling the warning in my defaults. Update: Matthew Fernandez pointed out that the unused label attribute eliminates the warning, solving my problem:

    for (int i = 0; i < n; i++) {
        loop: __attribute((unused))
        // ...
    }

More often I use an assembly label, usually named b for convenience:

    for (int i = 0; i < n; i++) {
        asm ("b:");
        // ...
    }

Like int3, sometimes it’s necessary to give it a nop so that GDB has something on which to break. “Enabling” it at any time is quick:

(gdb) b b

Because it’s not .globl, it’s a weak symbol, and I can place up to one per translation unit, all covered by the same GDB breakpoint item (less useful than it sounds). I haven’t actually checked, but I probably more often use dprintf with such named lines than actual breakpoints.

If you have similar tips and tricks of your own, I’d like to learn about them!

So you want custom allocator support in your C library

2023-12-17T17:52:26Z

This article was discussed on Hacker News and on reddit.

Users of mature C libraries conventionally get to choose how memory is allocated — that is, when it cannot be avoided entirely. The C standard never laid down a convention — perhaps for the better — so each library re-invents an allocator interface. Not all are created equal, and most repeat a few fundamental mistakes. Often the interface is merely a token effort, to check off that it’s “supported” without actual consideration to its use. This article describes the critical features of a practical allocator interface, and demonstrates why they’re important.

Before diving into the details, here’s the checklist for library authors:

All allocation functions accept a user-defined context pointer.
The “free” function accepts the original allocation size.
The “realloc” function accepts both old and new size.

Context pointer

The standard library allocator keeps its state in global variables. This makes for a simple interface, but comes with significant performance and complexity costs. These costs likely motivate custom allocator use in the first place, in which case slavishly duplicating the standard interface is essentially the worst possible option. Unfortunately this is typical:

#define LIB_MALLOC  malloc
#define LIB_FREE    free

I could observe the library’s allocations, and I could swap in a library functionality equivalent to the standard library allocator — jemalloc, mimalloc, etc. — but that’s about it. Better than nothing, I suppose, but only just so. Function pointer callbacks are slightly better:

typedef struct {
    void *(*malloc)(size_t);
    void  (*free)(void *);
} allocator;

session *session_new(..., allocator);

At least I could use different allocators at different times, and there are even tricks to bind a context pointer to the callback. It also works when the library is dynamically linked.

Either case barely qualifies as custom allocator support, and they’re useless when it matters most. Only a small ingredient is needed to make these interfaces useful: a context pointer.

// NOTE: Better, but still not great
typedef struct {
    void *(*malloc)(size_t, void *ctx);
    void  (*free)(void *, void *ctx);
    void   *ctx;
} allocator;

Users can choose from where the library will allocate at at given time. It liberates the allocator from global variables (or janky workarounds), and multithreading woes. The default can still hook up to the standard library through stubs that fit these interfaces.

static void *lib_malloc(size_t size, void *ctx)
{
    (void)ctx;
    return malloc(size);
}

static void *lib_free(void *ptr, void *ctx)
{
    (void)ctx;
    free(ptr);
}

static allocator lib_allocator = {lib_malloc, lib_free, 0};

Note that the context pointer came after the “standard” arguments. All things being equal, “extra” arguments should go after standard ones. But don’t sweat it! In the most common calling conventions this allows stub implementations to be merely an unconditional jump. It’s as though the stubs are a kind of subtype of the original functions.

lib_malloc:
        jmp malloc
lib_free:
        jmp free

Typically the decision is completely arbitrary, and so this minutia tips the balance.

Context pointer example

So what’s the big deal? It means we can trivially plug in, say, a tiny arena allocator. To demonstrate, consider this fictional string set and partial JSON API, each of which supports a custom allocator. For simplicity — I’m attempting to balance substance and brevity — they share an allocator interface. (Note: Because subscripts and sizes should be signed, and we’re now breaking away from the standard library allocator, I will use ptrdiff_t for the rest of the examples.)

typedef struct {
    void *(*malloc)(ptrdiff_t, void *ctx);
    void  (*free)(void *, void *ctx);
    void   *ctx;
} allocator;

typedef struct set set;
set  *set_new(allocator *);
set  *set_free(set *);
bool  set_add(set *, char *);

typedef struct json json;
json     *json_load(char *buf, ptrdiff_t len, allocator *);
json     *json_free(json *);
ptrdiff_t json_length(json *);
json     *json_subscript(json *, ptrdiff_t i);
json     *json_getfield(json *, char *field);
double    json_getnumber(json *);
char     *json_getstring(json *);

set and json objects retain a copy of the allocator object for all allocations made through that object. Given nothing, they default to the standard library using the pass-through definitions above. Used together with the standard library allocator:

typedef struct {
    double sum;
    bool   ok;
} sum_result;

sum_result sum_unique(char *json, ptrdiff_t len)
{
    sum_result r = {0};
    json *namevals = json_load(json, len, 0);
    if (!namevals) {
        return r;  // parse error
    }

    ptrdiff_t arraylen = json_length(namevals);
    if (arraylen < 0) {
        json_free(namevals);
        return r;  // not an array
    }

    set *seen = set_new(0);
    for (ptrdiff_t i = 0; i < arraylen; i++) {
        json *element = json_subscript(namevals, i);
        char *name    = json_getfield(element, "name");
        char *value   = json_getfield(element, "value");
        if (!name || !value) {
            set_free(set);
            json_free(namevals);
            return r;  // invalid element
        } else if (set_add(set, name)) {
            r.sum += json_getnumber(value);
        }
    }

    set_free(set);
    json_free(namevals);
    r.ok = 1;
    return r;
}

Which given as JSON input:

[
    {"name": "foo", "value":  123},
    {"name": "bar", "value":  456},
    {"name": "foo", "value": 1000}
]

Would return 579.0. Because it’s using standard library allocation, it must carefully clean up before returning. There’s also no out-of-memory handling because, in practice, programs typically do not get to observe and respond to the standard allocator running out of memory.

We can improve and simplify it with an arena allocator:

typedef struct {
    char    *beg;
    char    *end;
    jmp_buf *oom;
} arena;

void *arena_malloc(ptrdiff_t size, void *ctx)
{
    arena *a = ctx;
    ptrdiff_t available = a->end - a->beg;
    ptrdiff_t alignment = -size & 15;
    if (size > available-alignment) {
        longjmp(*a->oom);
    }
    return a->end -= size + alignment;
}

void arena_free(void *ptr, void *ctx)
{
    // nothing to do (yet!)
}

I’m allocating from the end rather than the beginning because it will make a later change simpler. Applying that to the function:

sum_result sum_unique(char *json, ptrdiff_t len, arena scratch)
{
    sum_result r = {0};

    allocator a = {0};
    a.malloc = arena_malloc;
    a.free = arena_free;
    a.ctx = &scratch;

    json *namevals = json_load(json, len, &a);
    if (!namevals) {
        return r;  // parse error
    }

    ptrdiff_t arraylen = json_length(namevals);
    if (arraylen < 0) {
        return r;  // not an array
    }

    set *seen = set_new(&a);
    for (ptrdiff_t i = 0; i < arraylen; i++) {
        json *element = json_subscript(namevals, i);
        char *name    = json_getfield(element, "name");
        char *value   = json_getfield(element, "value");
        if (!name || !value) {
            return r;  // invalid element
        } else if (set_add(set, name)) {
            r.sum += json_getnumber(value);
        }
    }
    r.ok = 1;
    return r;
}

Calls to set_free and json_free are no longer necessary because the arena automatically frees these on any return, in O(1). I almost feel bad the library authors bothered to write them! It also handles allocation failure without introducing it to sum_unique. We may even deliberately restrict the memory available to this function — perhaps because the input is untrusted, and we want to quickly abort denial-of-service attacks — by giving it a small arena, relying on out-of-memory to reject pathological inputs.

There are so many possibilities unlocked by the context pointer.

Provide the original allocation size when freeing

When an application frees an object it always has the original, requested allocation size on hand. After all, it’s a necessary condition to use the object correctly. In the simplest case it’s the size of the freed object’s type: a static quantity. If it’s an array, then it’s a multiple of the tracked capacity: a dynamic quantity. In any case the size is either known statically or tracked dynamically by the application.

Yet free() does not accept a size, meaning that the allocator must track the information redundantly! That’s a needless burden on custom allocators, and with a bit of care a library can lift it.

This was noticed in C++, and WG21 added sized deallocation in C++14. It’s now the default on two of the three major implementations (and probably not the two you’d guess). In other words, object size is so readily available that it can mostly be automated away. Notable exception: operator new[] and operator delete[] with trivial destructors. With non-trivial destructors, operator new[] must track the array length for its its own purposes on top of libc bookkeeping. In other words, array allocations have their size stored in at least three different places! C23 later gained a similar free_sized.

That means the “free” interface should look like this:

void *lib_free(void *ptr, ptrdiff_t len, void *ctx);

And calls inside the library might look like:

lib_free(p, sizeof(*p), ctx);
lib_free(a, sizeof(*a)*len, ctx);

Now that arena_free has size information, it can free an allocation if it was the most recent:

void arena_free(void *ptr, ptrdiff_t size, void *ctx)
{
    arena *a = ctx;
    if (ptr == a->end) {
        ptrdiff_t alignment = -size & 15;
        a->end += size + alignment;
    }
}

If the library allocates short-lived objects to compute some value, then discards in reverse order, the memory can be reused. The arena doesn’t have to do anything special. The library merely needs to share its knowledge with the allocator.

Beyond arena allocation, an allocator could use the size to locate the allocation’s size class and, say, push it onto a freelist of its size class. Size-class freelists compose well with arenas, and an implementation is short and simple when the caller of “free” communicates object size.

Another idea: During testing, use a debug allocator that tracks object size and validates the reported size against its own bookkeeping. This can help catch mistakes sooner.

Provide the old size when resizing an allocation

Resizing an allocation requires a lot from an allocator, and it should be avoided if possible. At the very least it cannot be done at all without knowing the original allocation size. An allocator can’t simply no-op it like it can with “free.” With the standard library interface, allocators have no choice but to redundantly track object sizes when “realloc” is required.

So, just as with “free,” the allocator should be given the old object size!

void *lib_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx);

At the very least, an allocator could implement “realloc” with “malloc” and memcpy:

void arena_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx)
{
    assert(new > old);
    void *r = arena_malloc(new, ctx);
    return memcpy(r, ptr, old);
}

Of the three checklist items, this is the most neglected. Exercise for the reader: The last-allocated object can be resized in place, instead using memmove. If this is frequently expected, allocate from the front, adjust arena_free as needed, and extend the allocation in place as discussed a previous addendum, without any copying.

Real world examples

Let’s examine real world examples to see how well they fit the checklist. First up is uthash, a popular, easy-to-use, intrusive hash table:

#define uthash_malloc(sz) my_malloc(sz)
#define uthash_free(ptr, sz) my_free(ptr)

No “realloc” so it trivially checks (3). It optionally provides the old size to “free” which checks (2). However it misses (1) which is the most important, greatly limiting its usefulness.

Next is the venerable zlib. It has function pointers with these prototypes on its z_stream object.

void *zlib_malloc(void *ctx, unsigned items, unsigned size);
void  zlib_free(void *ctx, void *ptr);

The context pointer checks (1), and I can confirm from experience that it’s genuinely useful with a custom allocator. No “realloc” so it passes (3) automatically. It misses (2), but in practice this hardly matters: It allocates everything up front, and frees at the very end, meaning a no-op “free” is quite sufficient.

Finally there’s the Lua programming language with this economical, single-function interface:

void *lua_Alloc(void *ctx, void *ptr, size_t old, size_t new);

It packs all three allocator functions into one function. It includes a context pointer (1), a free size (2), and two realloc sizes (3). It’s a simple allocator’s best friend!

My personal C coding style as of late 2023

2023-10-08T23:30:57Z

This article was discussed on Hacker News and on reddit.

This has been a ground-breaking year for my C skills, and paradigm shifts in my technique has provoked me to reconsider my habits and coding style. It’s been my largest personal style change in years, so I’ve decided to take a snapshot of its current state and my reasoning. These changes have produced significant productive and organizational benefits, so while most is certainly subjective, it likely includes a few objective improvements. I’m not saying everyone should write C this way, and when I contribute code to a project I follow their local style. This is about what works well for me.

Primitive types

Starting with the fundamentals, I’ve been using short names for primitive types. The resulting clarity was more than I had expected, and it’s made my code more enjoyable to review. These names appear frequently throughout a program, so conciseness pays. Also, now that I’ve gone without, _t suffixes are more visually distracting than I had realized.

typedef uint8_t   u8;
typedef char16_t  c16;
typedef int32_t   b32;
typedef int32_t   i32;
typedef uint32_t  u32;
typedef uint64_t  u64;
typedef float     f32;
typedef double    f64;
typedef uintptr_t uptr;
typedef char      byte;
typedef ptrdiff_t size;
typedef size_t    usize;

Some people prefer an s prefix for signed types. I prefer i, plus as you’ll see, I have other designs for s. For sizes, isize would be more consistent, and wouldn’t hog the identifier, but signed sizes are the way and so I want them in a place of privilege. usize is niche, mainly for interacting with external interfaces where it might matter.

b32 is a “32-bit boolean” and communicates intent. I could use _Bool, but I’d rather stick to a natural word size and stay away from its weird semantics. To beginners it might seem like “wasting memory” by using a 32-bit boolean, but in practice that’s never the case. It’s either in a register (return value, local variable) or would be padded anyway (struct field). When it actually matters, I pack booleans into a flags variable, and a 1-byte boolean rarely important.

While UTF-16 might seem niche, it’s a necessary evil when dealing with Win32, so c16 (“16-bit character”) has made a frequent appearance. I could have based it on uint16_t, but putting the name char16_t in its “type hierarchy” communicates to debuggers, particularly GDB, that for display purposes these variables hold character data. Officially Win32 uses a type named wchar_t, but I like being explicit about UTF-16.

u8 is for octets, usually UTF-8 data. It’s distinct from byte, which represents raw memory and is a special aliasing type. In theory these can be distinct types with differing semantics, though I’m not aware of any implementation that does so (yet?). For now it’s about intent.

What about systems that don’t support fixed width types? That’s academic, and far too much time has been wasted worrying about it. That includes time wasted on typing out int_fast32_t and similar nonsense. Virtually no existing software would actually work correctly on such systems — I’m certain nobody’s testing it after all — so it seems nobody else cares either.

I don’t intend to use these names in isolation, such as in code snippets (outside of this article). If I did, examples would require the typedefs to give readers the complete context. That’s not worth extra explanation. Even in the most recent articles I’ve used ptrdiff_t instead of size.

Macros

Next, some “standard” macros:

#define countof(a)    (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s)   (countof(s) - 1)
#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

While I still prefer ALL_CAPS for constants, I’ve adopted lowercase for function-like macros because it’s nicer to read. They don’t have the same namespace problems as other macro definitions: I can have a macro named new() and also variables and fields named new because they don’t look like function calls.

For GCC and Clang, my favorite assert macro now looks like this:

#define assert(c)  while (!(c)) __builtin_unreachable()

It has useful properties beyond the usual benefits:

It does not require separate definitions for debug and release builds. Instead it’s controlled by the presence of Undefined Behavior Sanitizer (UBSan), which is already present/absent in these circumstances. That includes fuzz testing.
libubsan provides a diagnostic printout with a file and line number.
In release builds it turns into a practical optimization hint.

To enable assertions in release builds, put UBSan in trap mode with -fsanitize-trap and then enable at least -fsanitize=unreachable. In theory this can also be done with -funreachable-traps, but as of this writing it’s been broken for the past few GCC releases.

Parameters and functions

No const. It serves no practical role in optimization, and I cannot recall an instance where it caught, or would have caught, a mistake. I held out for awhile as prototype documentation, but on reflection I found that good parameter names were sufficient. Dropping const has made me noticeably more productive by reducing cognitive load and eliminating visual clutter. I now believe its inclusion in C was a costly mistake.

(One small exception: I still like it as a hint to place static tables in read-only memory closer to the code. I’ll cast away the const if needed. This is only of minor importance.)

Literal 0 for null pointers. Short and sweet. This is not new, but a style I’ve used for about 7 years now, and has appeared all over my writing since. There are some theoretical edge cases where it may cause defects, and lots of ink has been spilled on the subject, but after a couple 100K lines of code I’ve yet to see it happen.

restrict when necessary, but better to organize code so that it’s not, e.g. don’t write to “out” parameters in loops, or don’t use out parameters at all (more on that momentarily). I don’t bother with inline because I compile everything as one translation unit anyway.

typedef all structures. I used to shy away from it, but eliminating the struct keyword makes code easier to read. If it’s a recursive structure, use a forward declaration immediately above so that such fields can use the short name:

typedef struct map map;
struct map {
    map *child[4];
    // ...
};

Declare all functions static except for entry points. Again, with everything compiled as a single translation unit there’s no reason to do otherwise. It was probably a mistake for C not to default to static, though I don’t have a strong opinion on the matter. With the clutter eliminated through short types, no const, no struct, etc. functions fit comfortably on the same line as their return type. I used to break them apart so that the function name began on its own line, but that’s no longer necessary.

In my writing I sometimes omit static to simplify, and because outside the context of a complete program it’s mostly irrelevant. However, I will use it below to emphasize this style.

For awhile I capitalized type names as that effectively put them in a kind of namespace apart from variables and functions, but I eventually stopped. I may try this idea in different way in the future.

Strings

One of my most productive changes this year has been the total rejection of null terminated strings — another of those terrible mistakes — and the embrace of this basic string type:

#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
    u8  *data;
    size len;
} s8;

I’ve used a few names for it, but this is my favorite. The s is for string, and the 8 is for UTF-8 or u8. The s8 macro (sometimes just spelled S) wraps a C string literal, making a s8 string out of it. A s8 is handled like a fat pointer, passed and returned by copy. s8 makes for a great function prefix, unlike str, all of which are reserved. Some examples:

static s8   s8span(u8 *, u8 *);
static b32  s8equals(s8, s8);
static size s8compare(s8, s8);
static u64  s8hash(s8);
static s8   s8trim(s8);
static s8   s8clone(s8, arena *);

Then when combined with the macro:

    if (s8equals(tagname, s8("body"))) {
        // ...
    }

You might be tempted to use a flexible array member to pack the size and array together as one allocation. Tried it. Its inflexibility is totally not worth whatever benefits it might have. Consider, for instance, how you’d create such a string out of a literal, and how it would be used.

A few times I’ve thought, “This program is simple enough that I don’t need a string type for this data.” That thought is nearly always wrong. Having it available helps me think more clearly, and makes for simpler programs. (C++ got it only a few years ago with std::string_view and std::span.)

It has a natural UTF-16 counterpart, s16:

#define s16(s) (s16){u##s, lengthof(u##s)}
typedef struct {
    c16 *data;
    size len;
} s16;

I’m not entirely sold on gluing u to the literal in the macro, versus writing it out on the string literal.

More structures

Another change has been preferring structure returns instead of out parameters. It’s effectively a multiple value return, though without destructuring. A great organizational change. For example, this function returns two values, a parse result and a status:

typedef struct {
    i32 value;
    b32 ok;
} i32parsed;

static i32parsed i32parse(s8);

Worried about the “extra copying?” Have no fear, because in practice calling conventions turn this into a hidden, restrict-qualified out parameter — if it’s not inlined such that any return value overhead would be irrelevant anyway. With this return style I’m less tempted to use in-band signals like special null returns to indicate errors, which is less clear.

It’s also led to a style of defining a zero-initialized return value at the top of the function, i.e. ok is false, and then use it for all return statements. On error, it can bail out with an immediate return. The success path sets ok to true before the return.

static i32parsed i32parse(s8 s)
{
    i32parsed r = {0};
    for (size i = 0; i < s.len; i++) {
        u8 digit = s.data[i] - '0';
        // ...
        if (overflow) {
            return r;
        }
        r.value = r.value*10 + digit;
    }
    r.ok = 1;
    return r;
}

Aside from static data, I’ve also moved away from initializers except the conventional zero initializer. (Notable exception: s8 and s16 macros.) This includes designated initializers. Instead I’ve been initializing with assignments. For example, this buffered output “constructor”:

typedef struct {
    u8 *buf;
    i32 len;
    i32 cap;
    i32 fd;
    b32 err;
} u8buf;

static u8buf newu8buf(arena *perm, i32 cap, i32 fd)
{
    u8buf r = {0};
    r.buf = new(perm, u8, cap);
    r.cap = cap;
    r.fd  = fd;
    return r;
}

I like how this reads, but it also eliminates a cognitive burden: The assignments are separated by sequence points, giving them an explicit order. It doesn’t matter here, but in other cases it does:

    example e = {
        .name = randname(&rng),
        .age  = randage(&rng),
        .seat = randseat(&rng),
    };

There are 6 possible values for e from the same seed. I like no longer thinking about these possibilities.

Odds and ends

Prefer __attribute to __attribute__. The __ suffix is excessive and unnecessary.

__attribute((malloc, warn_unused_result))

For Win32 systems programming, which typically only requires a modest number of declarations and definitions, rather than include windows.h, write the prototypes out by hand using custom types. It reduces build times, declutters namespaces, and interfaces more cleanly with the program (no more DWORD/BOOL/ULONG_PTR, but u32/b32/uptr).

#define W32(r) __declspec(dllimport) r __stdcall
W32(void)   ExitProcess(u32);
W32(i32)    GetStdHandle(u32);
W32(byte *) VirtualAlloc(byte *, usize, u32, u32);
W32(b32)    WriteConsoleA(uptr, u8 *, u32, u32 *, void *);
W32(b32)    WriteConsoleW(uptr, c16 *, u32, u32 *, void *);

For inline assembly, treat the outer parentheses like braces, put a space before the opening parenthesis, just like if, and start each constraint line with its colon.

static u64 rdtscp(void)
{
    u32 hi, lo;
    asm volatile (
        "rdtscp"
        : "=d"(hi), "=a"(lo)
        :
        : "cx", "memory"
    );
    return (u64)hi<<32 | lo;
}

There’s surely a lot more to my style than this, but unlike the above, those details haven’t changed this year. To see most of the mentioned items in action in a small program, see wordhist.c, one of my testing grounds for hash-tries, or for a slightly larger program, asmint.c, a mini programming language implementation.

A simple, arena-backed, generic dynamic array for C

2023-10-05T23:05:57Z

Previously I presented an arena-friendly hash map applicable to any programming language where one might use arena allocation. In this third article I present a generic, arena-backed dynamic array. The details are specific to C, as the most appropriate mechanism depends on the language (e.g. templates, generics). Just as in the previous two articles, the goal is to demonstrate an idea so simple that a full implementation fits on one terminal pager screen — a concept rather than a library.

Unlike a hash map or linked list, a dynamic array — a data buffer with a size that varies during run time — is more difficult to square with arena allocation. They’re contiguous by definition, and we cannot resize objects in the middle of an arena, i.e. realloc. So while convenient, they come with trade-offs. At least until they stop growing, dynamic arrays are more appropriate for shorter-lived, temporary contexts, where you would use a scratch arena. On average they consume about twice the memory of a fixed array of the same size.

As before, I begin with a motivating example of its use. The guts of the generic dynamic array implementation are tucked away in a push() macro, which is essentially the entire interface.

typedef struct {
    int32_t  *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} int32s;

int32s fibonacci(int32_t max, arena *perm)
{
    static int32_t init[] = {0, 1};
    int32s fib = {0};
    fib.data = init;
    fib.len = fib.cap = countof(init);

    for (;;) {
        int32_t a = fib.data[fib.len-2];
        int32_t b = fib.data[fib.len-1];
        if (a+b > max) {
            return fib;
        }
        *push(&fib, perm) = a + b;
    }
}

Anyone familiar with Go will quickly notice a pattern: int32s looks an awful lot like a Go slice. That was indeed my inspiration, and there is enough context that you could infer similar semantics. I will even call these “slice headers.” Initially I tried a design based on stretchy buffers, but I didn’t like the macros nor the ergonomics.

I wouldn’t write a fibonacci this way in practice, but it’s useful for highlighting certain features. Of particular note:

The dynamic array initially wraps a static array, yet I can append to it as though it were a dynamic allocation. If I don’t append at all, it still works. (Though of course the caller then shouldn’t modify the elements.)
push() operates on any object which is slice-shaped. That is it has a pointer field named data, a ptrdiff_t length field named len, a ptrdiff_t capacity field named cap, and all in that order.
push() evaluates to a pointer to the newly-pushed element. In my example I immediately dereference and assign a value.
An element is zero-initialized the first time it’s pushed. I say “first time” because you can truncate an array by reducing len, and “pushing” afterward will simply reveal the original elements.
The name int32s is intended to evoke plurality. I’ll use this convention again in a moment.
The arena passed to push() is only used if the array needs to grow. The new backing array will be allocated out of this arena regardless of the original backing array.
Resizes always change the backing array address, and the old array remains valid. This is also just like slices in Go.
Despite the name perm, I expect it points to the caller’s scratch arena. It’s “permanent” only relative to the fibonacci call. Otherwise I might build the array in a scratch arena, then create a final copy in a permanent arena.

For a slightly more realistic example: rendering triangles. Suppose we need data in array format for OpenGL, but we don’t know the number of vertices ahead of time. A dynamic array is convenient, especially if we discard the array as soon as OpenGL is done with it. We could build up entire scenes like this for each display frame.

typedef struct {
     GLfloat x, y, z;
} GLvert;

typedef struct {
    GLvert   *data;
    ptrdiff_t len;
    ptrdiff_t cap;
} GLverts;

void renderobj(char *buf, ptrdiff_t len, arena scratch)
{
    GLverts vs = {0};
    objparser parser = newobjparser(buf, len);
    for (...) {
        *push(&vs, &scratch) = nextvert(&parser);
    }
    glVertexPointer(3, GL_FLOAT, 0, vs.data);
    glDrawArrays(GL_TRIANGLES, 0, vs.len);
}

As before, GLverts is slice-shaped. This time it’s zero-initialized, which is a valid empty dynamic array. As with maps, that means any object with such a field comes with a ready-to-use empty dynamic array. Putting it together, here’s an example that gradually appends vertices to named dynamic arrays, randomly accessed by string name:

typedef struct {
    map    *child[4];
    str     name;
    GLverts verts;
} map;

verts *upsert(map **, str, arena *);  // from the last article

map *example(..., arena *perm)
{
    map *m = 0;
    for (...) {
        str name = ...;
        vert v = ...;
        verts *vs = upsert(&m, name, perm);
        *push(vs, perm) = v;
    }
    return m;
}

That’s what Go would call map[str][]vert, but allocated entirely out of an arena. Ever thought C could do this so simply and conveniently? The memory allocator (~15 lines), map (~30 lines), dynamic array (~30 lines), constructors (0 lines), and destructors (0 lines) that power this total to ~75 lines of zero-dependency code!

Implementation details

I despise macro abuse, and programs substantially implemented in macros are annoying. They’re difficult to understand and debug. A good dynamic array implementation will require a macro, and one of my goals was to keep it as simple and minimal as possible. The macro’s job is to:

Check the capacity and maybe grow the array via function call.
Smuggle type information (i.e. sizeof) to that function.
Compute a pointer of the proper type to the new element.

Here’s what I came up with:

#define push(s, arena) \
    ((s)->len >= (s)->cap \
        ? grow(s, sizeof(*(s)->data), arena), \
          (s)->data + (s)->len++ \
        : (s)->data + (s)->len++)

The macro will be used as an expression, so it cannot use statements like if. The condition is therefore a ternary operator. If it’s full, it calls the supporting grow function. In either case, it computes the result from data. In particular, note that the grow branch uses a comma operator to sequence growth before pointer derivation, as grow will change the value of data as a side effect.

To be generic, the grow function uses memcpy-based type punning:

static void grow(void *slice, ptrdiff_t size, arena *a)
{
    struct {
        void     *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } replica;
    memcpy(&replica, slice, sizeof(replica));

    replica.cap = replica.cap ? replica.cap : 1;
    ptrdiff_t align = 16;
    void *data = alloc(a, 2*size, align, replica.cap);
    replica.cap *= 2;
    if (replica.len) {
        memcpy(data, replica.data, size*replica.len);
    }
    replica.data = data;

    memcpy(slice, &replica, sizeof(replica));
}

The slice header is copied over a local replica, avoiding conflicts with strict aliasing. This is the archetype slice header. It still requires that different pointers have identical memory representation. That’s virtually always true, and certainly true anywhere I’d use an arena.

If the capacity was zero, it behaves as though it was one, and so, through doubling, zero-capacity arrays become capacity-2 arrays on the first push. It’s better to let alloc — whose definition, you may recall, included an overflow check — handle size overflow so that it can invoke the out of memory policy, so instead of doubling cap, which would first require an overflow check, it doubles the object size. This is a small constant (i.e. from sizeof), so doubling it is always safe.

Copying over old data includes a special check for zero-length inputs, because, quite frustratingly, memcpy does not accept null even when the length is zero. I check for zero length instead of null so that it’s more sensitive to defects. If the pointer is null with a non-zero length, it will trip Undefined Behavior Sanitizer, or at least crash the program, rather than silently skip copying.

Finally the updated replica is copied over the original slice header, updating it with the new data pointer and capacity. The original backing array is untouched but is no longer referenced through this slice header. Old slice headers will continue to function with the old backing array, such as when the arena is reset to a point where the dynamic array was smaller.

    int32s vals = {0};
    *push(&vals, &scratch) = 1;  // resize: cap=2
    *push(&vals, &scratch) = 2;
    *push(&vals, &scratch) = 3;  // resize: cap=4
    {
        arena tmp = scratch;  // scoped arena
        int32s extended = vals;
        *push(&extended, &tmp) = 4;
        *push(&extended, &tmp) = 5;  // resize: cap=8
        example(extended);
    }
    // vals still works, cap=4, extension freed

In practice, a dynamic array comes from old backing arrays whose total size adds up just shy of the current array capacity. For example, if the current capacity is 16, old arrays are size 2+4+8 = 14.

If you’re worried about misuse, such as slice header fields being in the wrong order, a couple of assertions can quickly catch such mistakes at run time, typically under the lightest of testing. In fact, I planned for this by using the more-sensitive len>=cap instead of just len==cap, so that it would direct execution towards assertions in grow:

    assert(replica.len >= 0);
    assert(replica.cap >= 0);
    assert(replica.len <= replica.cap);

This also demonstrates another benefit of signed sizes: Exactly half the range is invalid and so defects tend to quickly trip these assertions.

Alignment

Alignment is unfortunately fixed, and I picked a “safe” value of 16. In my new() macro I used _Alignof to pass type information to alloc. Due to an oversight, unlike sizeof, _Alignof cannot be applied to expressions, and so it cannot be used in dynamic arrays. GCC and Clang support _Alignof on expressions just like sizeof, as it’s such an obvious idea, but Microsoft chose to strictly follow the oversight in the standard. To support MSVC, I’ve deliberately limited the capabilities of push. If that doesn’t matter, fixing it is easy:

--- a/example.c
+++ b/example.c
@@ -2,3 +2,3 @@
     ((s)->len >= (s)->cap \
-        ? grow(s, sizeof(*(s)->data), arena), \
+        ? grow(s, sizeof(*(s)->data), _Alignof(*(s)->data), arena), \
           (s)->data + (s)->len++ \
@@ -6,3 +6,3 @@
 
-static void grow(void *slice, ptrdiff_t size, arena *a)
+static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
 {
@@ -16,3 +16,2 @@
     replica.cap = replica.cap ? replica.cap : 1;
-    ptrdiff_t align = 16;
     void *data = alloc(a, 2*size, align, replica.cap);

Though while you’re at it, if you’re already using extensions you might want to switch push to a statement expression so that the slice header s does not get evaluated more than once — i.e. so that upsert() in my example above could be used inside the push() expession.

#define push(s, a) ({ \
    typeof(s) s_ = (s); \
    typeof(a) a_ = (a); \
    if (s_->len >= s_->cap) { \
        grow(s_, sizeof(*s_->data), _Alignof(*s_->data), a_); \
    } \
    s_->data + s_->len++; \
})

So far this approach to dynamic arrays has been useful on a number of occasions, and I’m quite happy with the results. As with arena-friendly hash maps, I’ve no doubt they’ll become a staple in my C programs.

Addendum: extend the last allocation

Dennis Schön suggests a check if the array ends at the next arena allocation and, if so, extend the array into the arena in place. grow() already has the necessary information on hand, so it needs only the additional check:

static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
    struct {
        char     *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } replica;
    memcpy(&replica, slice, sizeof(replica));

    if (!replica.data) {
        replica.cap = 1;
        replica.data = alloc(a, 2*size, align, replica.cap);
    } else if (a->beg == replica.data + size*replica.cap) {
        alloc(a, size, 1, replica.cap);
    } else {
        void *data = alloc(a, 2*size, align, replica.cap);
        memcpy(data, replica.data, size*replica.len);
        replica.data = data;
    }

    replica.cap *= 2;
    memcpy(slice, &replica, sizeof(replica));
}

Because that’s yet another check for null, I’ve split it out into an independent third case:

If the data pointer is null, make an initial allocation.
If the array ends at the next arena allocation, extend it.
Otherwise allocate a fresh array and copy.

Not quite as simple, but it improves the most common case.

An easy-to-implement, arena-friendly hash map

2023-09-30T23:18:40Z

My last article had tips for for arena allocation. This next article demonstrates a technique for building bespoke hash maps that compose nicely with arena allocation. In addition, they’re fast, simple, and automatically scale to any problem that could reasonably be solved with an in-memory hash map. To avoid resizing — both to better support arenas and to simplify implementation — they have slightly above average memory requirements. The design, which we’re calling a hash-trie, is the result of fruitful collaboration with NRK, whose sibling article includes benchmarks. It’s my new favorite data structure, and has proven incredibly useful. With a couple well-placed acquire/release atomics, we can even turn it into a lock-free concurrent hash map.

I’ve written before about MSI hash tables, a simple, very fast map that can be quickly implemented from scratch as needed, tailored to the problem at hand. The trade off is that one must know the upper bound a priori in order to size the base array. Scaling up requires resizing the array — an impedance mismatch with arena allocation. Search trees scale better, as there’s no underlying array, but tree balancing tends to be finicky and complex, unsuitable to rapid, on-demand implementation. We want the ease of an MSI hash table with the scaling of a tree.

I’ll motivate the discussion with example usage. Suppose we have an array of pointer+length strings, as defined last time:

typedef struct {
    uint8_t  *data;
    ptrdiff_t len;
} str;

And we need a function that removes duplicates in place, but (for the moment) we’re not worried about preserving order. This could be done naively in quadratic time. Smarter is to sort, then look for runs. Instead, I’ve used a hash map to track seen strings. It maps str to bool, and it is represented as type strmap and one insert+lookup function, upsert.

// Insert/get bool value for given str key.
bool *upsert(strmap **, str key, arena *);

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    strmap *seen = 0;
    while (count < len) {
        bool *b = upsert(&seen, strings[count], &scratch);
        if (*b) {
            // previously seen (discard)
            strings[count] = strings[--len];
        } else {
            // newly-seen (keep)
            count++;
            *b = 1;
        }
    }
    return count;
}

In particular, note:

A null pointer is an empty hash map and initialization is trivial. As discussed in the last article, one of my arena allocation principles is default zero-initializion. Put together, that means any data structure containing a map comes with a ready-to-use, empty map.
The map is allocated out of the scratch arena so it’s automatically freed upon any return. It’s as care-free as garbage collection.
The map directly uses strings in the input array as keys, without making copies nor worrying about ownership. Arenas own objects, not references. If I wanted to carve out some fixed keys ahead of time, I could even insert static strings.
upsert returns a pointer to a value. That is, a pointer into the map. This is not strictly required, but usually makes for a simple interface. When an entry is new, this value will be false (zero-initialized).

So, what is this wonderful data structure? Here’s the basic shape:

typedef struct {
    hashmap *child[4];
    keytype  key;
    valtype  value;
} hashmap;

They child and key fields are essential to the map. Adding a child to any data structure turns it into a hash map over whatever field you choose as the key. In other words, a hash-trie can serve as an intrusive hash map. In several programs I’ve combined intrusive lists and hash maps to create an insert-ordered hash map. Going the other direction, omitting value turns it into a hash set. (Which is what unique really needs!)

As you probably guessed, this hash-trie is a 4-ary tree. It can easily be 2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but 4-ary strikes a good balance, if a bit bulky. In the example above, keytype would be str and valtype would be bool. The most general form of upsert looks like this:

valtype *upsert(hashmap **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return &(*m)->value;
        }
        m = &(*m)->child[h>>62];
    }
    if (!perm) {
        return 0;
    }
    *m = new(perm, hashmap);
    (*m)->key = key;
    return &(*m)->value;
}

This will take some unpacking. The first argument is a pointer to a pointer. That’s the destination for any newly-allocated element. As it travels down the tree, this points into the parent’s child array. If it points to null, then it’s an empty tree which, by definition, does not contain the key.

We need two “methods” for keys: hash and equals. The hash function should return a uniformly distributed integer. As is usually the case, less uniform fast hashes generally do better than highly-uniform slow hashes. For hash maps under ~100K elements a 32-bit hash is fine, but larger maps should use a 64-bit hash state and result. Hash collisions revert to linear, linked list performance and, per the birthday paradox, that will happen often with 32-bit hashes on large hash maps.

If you’re worried about pathological inputs, add a seed parameter to upsert and hash. Or maybe even use the address m as a seed. The specifics depend on your security model. It’s not an issue for most hash maps, so I don’t demonstrate it here.

The top two bits of the hash are used to select a branch. These tend to be higher quality for multiplicative hash functions. At each level two bits are shifted out. This is what gives it its name: a trie of the hash bits. Though it’s un-trie-like in the way it deposits elements at the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a time.

I initially tried a Multiplicative Congruential Generator (MCG) to select the next branch at each trie level, instead of bit shifting, but NRK noticed it was consistently slower than shifting.

While “delete” could be handled using gravestones, many deletes would not work well. After all, the underlying allocator is an arena. A combination of uniformly distributed branching and no deletion means that rebalancing is unnecessary. This is what grants it its simplicity!

If no arena is provided, it reverts to a lookup and returns null when the key is not found. It allows one function to flexibly serve both modes. In unique, pure lookups are unneeded, so this condition could be skipped in its strmap.

Sometimes it’s useful to return the entire hashmap object itself rather than an internal pointer, particularly when it’s intrusive. Use whichever works best for the situation. Regardless, exploit zero-initialization to detect newly-allocated elements when possible.

In some cases we may deep copy the key in its arena before inserting it into the map. The provided key may be a temporary (e.g. sprintf) which the map outlives, and the caller doesn’t want to allocate a longer-lived key unless it’s needed. It’s all part of tailoring the map to the problem, which we can do because it’s so short and simple!

Fleshing it out

Putting it all together, unique could look like the following, with strmap/upsert renamed to strset/ismember:

uint64_t hash(str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i];
        h *= 1111111111111111111u;
    }
    return h;
}

bool equals(str a, str b)
{
    return a.len==b.len && !memcmp(a.data, b.data, a.len);
}

typedef struct {
    strset *child[4];
    str     key;
} strset;

bool ismember(strset **m, str key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return 1;
        }
        m = &(*m)->child[h>>62];
    }
    *m = new(perm, strset);
    (*m)->key = key;
    return 0;
}

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    for (strset *seen = 0; count < len;) {
        if (ismember(&seen, strings[count], &scratch)) {
            strings[count] = strings[--len];
        } else {
            count++;
        }
    }
    return count;
}

The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with an xorshift finalizer because the bits are used most-significant first. Exercise for the reader: Support retaining the original input order using an intrusive linked list on strset.

Relative pointers?

As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes these hash-tries a bit heavier than average. It’s not an issue for smaller hash maps, but has practical consequences for huge hash maps.

In attempt to address this, I experimented with relative pointers (example: markov.c). That is, instead of pointers I use signed integers whose value indicates an offset relative to itself. Because relative pointers can only refer to nearby memory, a custom allocator is imperative, and arenas fit the bill perfectly. Range can be extended by exploiting memory alignment. In particular, 32-bit relative pointers can reference up to 8GiB in either direction. Zero is reserved to represent a null pointer, and relative pointers cannot refer to themselves.

As a bonus, data structures built out of relative pointers are position independent. A collection of them — perhaps even a whole arena — can be dumped out to, say, a file, loaded back at a different position, then continue to operate as-is. Very cool stuff.

Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are guaranteed to work. No object is ever too far away. It’s a compounding effect, too. Smaller map nodes means a larger number of them are in reach of a relative pointer. Also very cool.

However, as far as I know, no generally available programming language implementation supports this concept well enough to put into practice. You could implement relative pointers with language extension facilities, such as C++ operator overloads, but no tools will understand them — a major bummer. You can no longer use a debugger to examine such structures, and it’s just not worth that cost. If only arena allocation was more popular…

As a concurrent hash map

For the finale, let’s convert upsert into a concurrent, lock-free hash map. That is, multiple threads can call upsert concurrently on the same map. Each must still have its own arena, probably per-thread arenas, and so no implicit locking for allocation.

The structure itself requires no changes! Instead we need two atomic operations: atomic load (acquire), and atomic compare-and-exchange (acquire/release). They operate only on child array elements and the tree root. To illustrate I will use GCC atomics, also supported by Clang.

valtype *upsert(map **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key);; h <<= 2) {
        map *n = __atomic_load_n(m, __ATOMIC_ACQUIRE);
        if (!n) {
            if (!perm) {
                return 0;
            }
            arena rollback = *perm;
            map *new = new(perm, map, 1);
            new->key = key;
            int pass = __ATOMIC_RELEASE;
            int fail = __ATOMIC_ACQUIRE;
            if (__atomic_compare_exchange_n(m, &n, new, 0, pass, fail)) {
                return &new->value;
            }
            *perm = rollback;
        }
        if (equals(n->key, key)) {
            return &n->value;
        }
        m = n->child + (h>>62);
    }
}

First an atomic load retrieves the current node. If there is no such node, then attempt to insert one using atomic compare-and-exchange. The ABA problem is not an issue thanks again to lack of deletion: Once set, a pointer never changes. Before allocating a node, take a snapshot of the arena so that the allocation can be reverted on failure. If another thread got there first, continue tumbling down the tree as though a null was never observed.

On compare-and-swap failure, it turns into an acquire load, just as it began. On success, it’s a release store, synchronizing with acquire loads on other threads.

The key field does not require atomics because it’s synchronized by the compare-and-swap. That is, the assignment will happen before the node is inserted, and keys do not change after insertion. The same goes for any zeroing done by the arena.

Loads and stores through the returned pointer are the caller’s responsibility. These likely require further synchronization. If valtype is a shared counter then an atomic increment is sufficient. In other cases, upsert should probably be modified to accept an initial value to be assigned alongside the key so that the entire key/value pair inserted atomically. Alternatively, break it into two steps. The details depend on the needs of the program.

On small trees there will much contention near the root of the tree during inserts. Fortunately, a contentious tree will not stay small for long! The hash function will spread threads around a large tree, generally keeping them off each other’s toes.

A complete demo you can try yourself: concurrent-hash-trie.c. It returns a value pointer like above, and store/load is synchronized by the thread join. Each thread is given a per-thread subarena allocated out of the main arena, and the final tree is built from these subarenas.

For a practical example: a multithreaded rainbow table to find hash function collisions. Threads are synchronized solely through atomics in the shared hash-trie.

A complete fast, concurrent, lock-free hash map in under 30 lines of C sounds like a sweet deal to me!

Arena allocator tips and tricks

2023-09-27T03:58:59Z

This article was discussed on Hacker News.

Over the past year I’ve refined my approach to arena allocation. With practice, it’s effective, simple, and fast; typically as easy to use as garbage collection but without the costs. Depending on need, an allocator can weigh just 7–25 lines of code — perfect when lacking a runtime. With the core details of my own technique settled, now is a good time to document and share lessons learned. This is certainly not the only way to approach arena allocation, but these are practices I’ve worked out to simplify programs and reduce mistakes.

An arena is a memory buffer and an offset into that buffer, initially zero. To allocate an object, grab a pointer at the offset, advance the offset by the size of the object, and return the pointer. There’s a little more to it, such as ensuring alignment and availability. We’ll get to that. Objects are not freed individually. Instead, groups of allocations are freed at once by restoring the offset to an earlier value. Without individual lifetimes, you don’t need to write destructors, nor do your programs need to walk data structures at run time to take them apart. You also no longer need to worry about memory leaks.

A minority of programs inherently require general purpose allocation, at least in part, that linear allocation cannot fulfill. This includes, for example, most programming language runtimes. If you like arenas, avoid accidentally create such a situation through an over-flexible API that allows callers to assume you have general purpose allocation underneath.

To get warmed up, here’s my style of arena allocation in action that shows off multiple features:

typedef struct {
    uint8_t  *data;
    ptrdiff_t len;
} str;

typedef struct {
    strlist *next;
    str      item;
} strlist;

typedef struct {
    str head;
    str tail;
} strpair;

// Defined elsewhere
void    towidechar(wchar_t *, ptrdiff_t, str);
str     loadfile(wchar_t *, arena *);
strpair cut(str, uint8_t);

strlist *getlines(str path, arena *perm, arena scratch)
{
    int max_path = 1<<15;
    wchar_t *wpath = new(&scratch, wchar_t, max_path);
    towidechar(wpath, max_path, path);

    strpair pair = {0};
    pair.tail = loadfile(wpath, perm);

    strlist *head = 0;
    strlist **tail = &head;
    while (pair.tail.len) {
        pair = cut(pair.tail, '\n');
        *tail = new(perm, strlist, 1);
        (*tail)->item = pair.head;
        tail = &(*tail)->next;
    }
    return head;
}

Take note of these details, each to be later discussed in detail:

getlines takes two arenas, “permanent” and “scratch”. The former is for objects that will be returned to the caller. The latter is for temporary objects whose lifetime ends when the function returns. They have stack lifetimes just like local variables.
Objects are not explicitly freed. Instead, all allocations from a scratch arena are implicitly freed upon return. This would include error return paths automatically.
The scratch arena is passed by copy — i.e. a copy of the “header” not the memory region itself. Allocating only changes the local copy, and so cannot survive the return. The semantics are obvious to callers, so they’re less likely to get mixed up.
While wpath could be an automatic local variable, it’s relatively large for the stack, so it’s allocated out of the scratch arena. A scratch arena safely permits large, dynamic allocations that would never be safe on the stack. In other words, a sane alloca! Same for variable-length arrays (VLAs). A scratch arena means you’ll never be tempted to use either of these terrible ideas.
The second parameter to new is a type, so it’s obviously a macro. As you will see momentarily, this is not some complex macro magic, just a convenience one-liner. There is no implicit cast, and you will get a compiler diagnostic if the type is incorrect.
Despite all the allocation, there is not a single sizeof operator nor size computation. That’s because size computations are a major source of defects. That job is handled by specialized code.
Allocation failures are not communicated by a null return. Lifting this burden greatly simplifies programs. Instead such errors are handled non-locally by the arena.
All allocations are zero-initialized by default. This makes for simpler, less error-prone programs. When that’s too expensive, this can become an opt-out without changing the default.

An arena implementation

An arena suitable for most cases can be this simple:

typedef struct {
    char *beg;
    char *end;
} arena;

void *alloc(arena *a, ptrdiff_t size, ptrdiff_t align, ptrdiff_t count)
{
    ptrdiff_t padding = -(uintptr_t)a->beg & (align - 1);
    ptrdiff_t available = a->end - a->beg - padding;
    if (available < 0 || count > available/size) {
        abort();  // one possible out-of-memory policy
    }
    void *p = a->beg + padding;
    a->beg += padding + count*size;
    return memset(p, 0, count*size);
}

Yup, just a pair of pointers! When allocating, all sizes are signed just as they ought to be. Unsigned sizes are another historically common source of defects, and offer no practical advantages in return.

The align parameter allows the arena to handle any unusual alignments, something that’s surprisingly difficult to do with libc. It’s difficult to appreciate its usefulness until it’s convenient.

The uintptr_t business may look unusual if you’ve never come across it before. To align beg, we need to compute the number of bytes to advance the address (padding) until the alignment evenly divides the address. The modulo with align computes the number of bytes it’s since the last alignment:

extra = addr % align

We can’t operate numerically on an address like this, so in the code we first convert to uintptr_t. Alignment is always a power of two, which notably excludes zero, so no worrying about division by zero. That also means we can compute modulo by subtracting one and masking with AND:

extra = addr & (align - 1)

However, we want the number of bytes to advance to the next alignment, which is the inverse:

padding = -addr & (align - 1)

Add the uintptr_t cast and you have the code in alloc.

The if tests if there’s enough memory and simultaneously for overflow on size*count. If either fails, it invokes the out-of-memory policy, which in this case is abort. I strongly recommend that, at least when testing, always having something in place to, at minimum, abort when allocation fails, even when you think it cannot happen. It’s easy to use more memory than you anticipate, and you want a reliable signal when it happens.

An alternative policy is to longjmp to a “handler”, which with GCC and Clang doesn’t even require runtime support. In that case add a jmp_buf to the arena:

typedef struct {
    char  *beg;
    char  *end;
    void **jmp_buf;
} arena;

void *alloc(...)
{
    // ...
    if (/* out of memory */) {
        __builtin_longjmp(a->jmp_buf, 1);
    }
    // ...
}

bool example(..., arena scratch)
{
    void *jmp_buf[5];
    if (__builtin_setjmp(jmp_buf)) {
        return 0;
    }
    scratch.jmp_buf = jmp_buf;
    // ...
    return 1;
}

example returns failure to the caller if it runs out of memory, without needing to check individual allocations and, thanks to the implicit free of scratch arenas, without needing to clean up. If callees receiving the scratch arena don’t set their own jmp_buf, they’ll return here, too. In a real program you’d probably wrap the setjmp setup in a macro.

Suppose zeroing is too expensive or unnecessary in some cases. Add a flag to opt out:

void *alloc(..., int flags)
{
    // ...
    return flag&NOZERO ? p : memset(p, 0, total);
}

Similarly, perhaps there’s a critical moment where you’re holding a non-memory resource (lock, file handle), or you don’t want allocation failure to be fatal. In either case, it’s important that the out-of-memory policy isn’t invoked. You could request a “soft” failure with another flag, and then do the usual null pointer check:

void *alloc(..., int flags)
{
    // ...
    if (/* out of memory */) {
        if (flags & SOFTFAIL) {
            return 0;
        }
        abort();
    }
    // ...
}

Most non-trivial programs will probably have at least one of these flags.

In case it wasn’t obvious, allocating an arena is simple:

arena newarena(ptrdiff_t cap)
{
    arena a = {0};
    a.beg = malloc(cap);
    a.end = a.beg ? a.beg+cap : 0;
    return a;
}

Or make a direct allocation from the operating system, e.g. mmap, VirtualAlloc. Typically arena lifetime is the whole program, so you don’t need to worry about freeing it. (Since you’re using arenas, you can also turn off any memory leak checkers while you’re at it.)

If you need more arenas then you can always allocate smaller ones out of the first! In multi-threaded applications, each thread may have at least its own scratch arena.

The `new` macro

I’ve shown alloc, but few parts of the program should be calling it directly. Instead they have a macro to automatically handle the details. I call mine new, though of course if you’re writing C++ you’ll need to pick another name (make? PushStruct?):

#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

The cast is an extra compile-time check, especially useful for avoiding mistakes in levels of indirection. It also keeps normal code from directly using the sizeof operator, which is easy to misuse. If you added a flags parameter, pass in zero for this common case. Keep in mind that the goal of this macro is to make common allocation simple and robust.

Often you’ll allocate single objects, and so the count is 1. If you think that’s ugly, you could make variadic version of new that fills in common defaults. In fact, that’s partly why I put count last!

#define new(...)            newx(__VA_ARGS__,new4,new3,new2)(__VA_ARGS__)
#define newx(a,b,c,d,e,...) e
#define new2(a, t)          (t *)alloc(a, sizeof(t), alignof(t), 1, 0)
#define new3(a, t, n)       (t *)alloc(a, sizeof(t), alignof(t), n, 0)
#define new4(a, t, n, f)    (t *)alloc(a, sizeof(t), alignof(t), n, f)

Not quite so simple, but it optionally makes for more streamlined code:

thing *t   = new(perm, thing);
thing *ts  = new(perm, thing, 1000);
char  *buf = new(perm, char, len, NOZERO);

Side note: If sizeof should be avoided, what about array lengths? That’s part of the problem! Hardly ever do you want the size of an array, but rather the number of elements. That includes char arrays where this happens to be the same number. So instead, define a countof macro that uses sizeof to compute the value you actually want. I like to have this whole collection:

#define sizeof(x)    (ptrdiff_t)sizeof(x)
#define countof(a)   (sizeof(a) / sizeof(*(a)))
#define lengthof(s)  (countof(s) - 1)

Yes, you can convert sizeof into a macro like this! It won’t expand recursively and bottoms out as an operator. countof also, of course, produces a less error-prone signed count so users don’t fumble around with size_t. lengthof statically produces null-terminated string length.

char msg[] = "hello world";
write(fd, msg, lengthof(msg));

#define MSG "hello world"
write(fd, MSG, lengthof(MSG));

Enhance `alloc` with attributes

At least for GCC and Clang, we can further improve alloc with three function attributes:

__attribute((malloc, alloc_size(2, 4), alloc_align(3)))
void *alloc(...);

malloc indicates that the pointer returned by alloc does not alias any existing object. Enables some significant optimizations that are otherwise blocked, most often by breaking potential loop-carried dependencies.

alloc_size tracks the allocation size for compile-time diagnostics and run-time assertions (__builtin_object_size). This generally requires a non-zero optimization level. In other words, you will get a compiler warnings about some out bounds accesses of arena objects, and with Undefined Behavior Sanitizer you’ll get run-time bounds checking. It’s a great complement to fuzzing.

Update June 2024: I’ve learned that alloc_size is fundamentally broken since its introduction in GCC 4.3.0 (March 2008). Correct use is impossible, and existing instances all rely on luck. In certain cases, such as function inlining, the pointer information is lost, and GCC may generate invalid code based on stale data.

In theory alloc_align may also allow better code generation, but I’ve yet to observe a case. Consider it optional and low-priority. I mention it only for completeness.

Arena size and growth

How large an arena should you allocate? The simple answer: As much as is necessary for the program to successfully complete. Usually the cost of untouched arena memory is low or even zero. Most programs should probably have an upper limit, at which point they assume something has gone wrong. Arenas allow this case to be handled gracefully, simplifying recovery and paving the way for continued operation.

While a sufficient answer for most cases, it’s unsatisfying. There’s a common assumption that programs should increase their memory usage as much as needed and let the operating system respond if it’s too much. However, if you’ve ever tried this yourself, you probably noticed that mainstream operating systems don’t handle it well. The typical results are system instability — thrashing, drivers crashing — possibly necessitating a reboot.

If you insist on this route, on 64-bit hosts you can reserve a gigantic virtual address space and gradually commit memory as needed. On Linux that means leaning on overcommit by allocating the largest arena possible at startup, which will automatically commit through use. Use MADV_FREE to decommit.

On Windows, VirtualAlloc handles reserve and commit separately. In addition to the allocation offset, you need a commit offset. Then expand the committed region ahead of the allocation offset as it grows. If you ever manually reset the allocation offset, you could decommit as well, or at least MEM_RESET. At some point commit may fail, which should then trigger the out-of-memory policy, but the system is probably in poor shape by that point — i.e. use an abort policy to release it all quickly.

Pointer laundering (filthy hack)

While allocations out of an arena don’t require individual error checks, allocating the arena itself at startup requires error handling. It would be nice if the arena could be allocated out of .bss and punt that job to the loader. While you could make a big, global char[] array to back your arena, it’s technically not permitted (strict aliasing). A “clean” .bss region could be obtained with a bit of assembly — .comm plus assembly to get the address into C without involving an array. I wanted a more portable solution, so I came up with this:

arena getarena(void)
{
    static char mem[1<<28];
    arena r = {0};
    r.beg = mem;
    asm ("" : "+r"(r.beg));  // launder the pointer
    r.end = r.beg + countof(mem);
    return r;
}

The asm accepts a pointer and returns a pointer ("+r"). The compiler cannot “see” that it’s actually empty, and so returns the same pointer. The arena will be backed by mem, but by laundering the address through asm, I’ve disconnected the pointer from its origin. As far the compiler is concerned, this is some foreign, assembly-provided pointer, not a pointer into mem. It can’t optimize away mem because it’s been given to a mysterious assembly black box.

While inappropriate for a real project, I think it’s a neat trick.

Arena-friendly container data structures

In my initial example I used a linked list to stores lines. This data structure is great with arenas. It only takes a few of lines of code to implement a linked list on top of an arena, and no “destroy” code is needed. Simple.

What about arena-backed associative arrays? Or arena-backed dynamic arrays? See these follow-up articles for details!

How to link identical function names from different DLLs

2023-08-27T01:46:31Z

For the typical DLL function call you declare the function prototype (via header file), you inform the link editor (ld, link) that the DLL exports a symbol with that name (import library), it matches the declared name with this export, and it becomes an import in your program’s import table. What happens when two different DLLs export the same symbol? The link editor will pick the first found. But what if you want to use both exports? If they have the same name, how could program or link editor distinguish them? In this article I’ll demonstrate a technique to resolve this by creating a program which links with and directly uses two different C runtimes (CRTs) simultaneously.

In PE executable images, an import isn’t just a symbol, but a tuple of DLL name and symbol. For human display, a tuple is typically formatted with an exclamation point delimiter, as in msvcrt.dll!malloc, though sometimes without the .dll suffix. You’ve likely seen this in stack traces. Because it’s a tuple and not just a symbol, it’s possible to refer to, and import, the same symbol from different DLLs. Contrast that with ELF, which has a list of shared objects, and a separate list of symbols, with the dynamic linker pairing them up at load time. That permits cool tricks like LD_PRELOAD, but for the same reason loading is less predictable.

Windows comes with several CRTs, and various libraries and applications use one or another (or none) depending on how they were built. As C standard library implementations they export mostly the same symbols, malloc, printf, etc. With imports as tuples, it’s not so unusual for an application to load multiple CRTs at once. Typically coexistence is transitive. That is, a module does not directly access both CRTs but depends on modules that use different CRTs. One module calls, say, msvcrt.dll!malloc, and another module calls ucrtbase.dll!malloc. With DLL-qualified symbols, this is sound so long as modules don’t cross the streams, e.g. an allocation in one module must not be freed in the other. Libraries in this ecosystem must avoid exposing their CRT through their interfaces, such as expecting the library’s caller to free() objects: The caller might not have access to the right free!

Contrast again with the unix ecosystem generally, where a process can only load one libc and everyone is expected to share. Libraries commonly expect callers to free() their objects (e.g. libreadline, xcb), blending their interface with libc.

Suppose you’re in such a situation where, due to unix-oriented libraries, your application must use functions from two different CRTs at once. One might have been compiled with Mingw-w64 and linked with MSVCRT, and the other compiled with MSVC and linked with UCRT. We need to call malloc and free in each, but they have the same name. What a pickle!

There’s an obvious, and probably most common, solution: run-time dynamic linking. Use load-time linking on one CRT, and LoadLibrary on the other CRT with GetProcAddress to obtain function pointers. However, it’s possible to do this entirely with load-time linking!

A malloc by any other name would allocate as well

Think about it a moment and you might wonder: If the names are the same, how can I pick which I’m calling? The tuple representation won’t work because ! cannot appear in an identifier, which is, after all, why it was chosen. The trick is that we’re going to rename one of them! To demonstrate, I’ll use my Windows development kit, w64devkit, a Mingw-w64 distribution that links MSVCRT. I’m going to use UCRT as the second CRT to access ucrtbase.dll!malloc.

I can choose whatever valid identifier I’d like, so I’m going to pick ucrt_malloc. This will require a declaration:

__declspec(dllimport) void *ucrt_malloc(size_t);

If I stop here and try to use it, of course it won’t work:

ld: undefined reference to `__imp_ucrt_malloc'

The linker hasn’t yet been informed of the change in management. For that we’ll need an import library. I’ll define one using a .def file, which I’ll name ucrtbase.def:

LIBRARY ucrtbase.dll
EXPORTS
ucrt_malloc == malloc

The last line says that this library has the symbol ucrt_malloc, but that it should be imported as malloc. This line is the lynchpin to the whole scheme. Note: The double equals is important, as a single equals sign means something different. Next, use dlltool to build the import library:

$ dlltool -d ucrtbase.def -l ucrtbase.lib

The equivalent MSVC tool is lib, but as far as I know it cannot quite do this sort of renaming. However, MSVC link will work just fine with this dlltool-created import library. The name ucrtbase.lib, while obvious, is irrelevant. It’s that LIBRARY line that ties it to the DLL. My test source file looks like this:

#include 

__declspec(dllimport) void *ucrt_malloc(size_t);

int main(void)
{
    void *msvcrt[] = {malloc(1), malloc(1), malloc(1)};
    void *ucrt[] = {ucrt_malloc(1), ucrt_malloc(1), ucrt_malloc(1)};
    return 0;
}

It compiles successfully:

$ cc -g3 -o main.exe main.c ucrtbase.lib

I can see the two malloc imports with objdump:

$ objdump -p main.exe
...
DLL Name: msvcrt.dll
...
844a	 1021  malloc
...
DLL Name: ucrtbase.dll
847e	    1  malloc

It loads and runs successfully, too:

$ gdb main.exe
Reading symbols from main.exe...
(gdb) break 9
Breakpoint 1 at 0x1400013cd: file main.c, line 9.
(gdb) run
Thread 1 hit Breakpoint 1, main () at main.c:9
9           return 0;
(gdb) p msvcrt
$1 = {0xd06a30, 0xd06a70, 0xd06ab0}
(gdb) p ucrt
$2 = {0x6e9490, 0x6eb7c0, 0x6eb800}

The pointer addresses confirm that these are two, distinct allocators. Perhaps you’re wondering what happens if I cross the streams?

int main(void)
{
    free(ucrt_malloc(1));
}

The MSVCRT allocator justifiably panics over the bad pointer:

$ cc -g3 -o chaos.exe chaos.c ucrtbase.lib
$ gdb -ex run chaos.exe
Starting program: chaos.exe
warning: HEAP[chaos.exe]:
warning: Invalid address specified to RtlFreeHeap
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffc42c369af in ntdll!RtlRegisterSecureMemoryCacheCallback ()
(gdb)

While you’re probably not supposed to meddle with ucrtbase.dll like this, the general principle of export renames is reasonable. I don’t expect I’ll ever need to do it, but I like that I have the option.

Everything you never wanted to know about Win32 environment blocks

2023-08-23T21:51:10Z

In an effort to avoid programming by superstition, I did a deep dive into the Win32 “environment block,” the data structure holding a process’s environment variables, in order to better understand it. Along the way I discovered implied and undocumented behaviors. (The environment block must not to be confused with the Process Environment Block (PEB) which is different.) Because I cannot possibly retain all the quirky details in my head for long, I’m writing them down for future reference. I ran my tests on different Windows versions as far back as Windows XP SP3 in order to fill in gaps where documentation is ambiguous, incomplete, or wrong. Overall conclusion: Correct, direct manipulation of an environment block is impossible in the general case due to under-specified and incorrect documentation. This has important consequences mainly for programming language runtimes.

Win32 has two interfaces for interacting with environment variables:

The first, which I’ll call get/set, is the easy interface, with Windows doing all the searching and sorting on your behalf. It’s also the only supported interface through which a process can manipulate its own variables. It has no function for enumerating variables.

The second, which I’ll call get/free, allocates a copy of the environment block. Calls to get/set does not modify existing copies. Similarly, manipulating this block has no effect on the environment as viewed through get/set. In other words, it’s read only. We can enumerate our environment variables by walking the environment block. As I will discuss below, enumeration is it’s only consistently useful purpose!

Technically it’s possible to access the actual environment block through undocumented fields in the PEB. It’s the same content as returned by get/free except that it’s not a copy. It cannot be accessed safely, so I’m ignoring this route.

The environment block format is a null-terminated block of null-terminated strings:

keyA=a\0keyBB=bb\0keyCCC=ccc\0\0

Each string ~~begins with a character other than = and~~ contains at least one =. In my tests this rule was strictly enforced by Windows, and I could not construct an environment block that broke this rule. This list is usually, but not always, sorted. It may contain repeated variables, but they’re always assigned the same value, which is also strictly enforced by Windows.

~~The get/free interface has no “set” function, and a process cannot set its own environment block to a custom buffer.~~ (Update: Stefan Kanthak points out SetEnvironmentStringsW. I missed it because it was only officially documented a few months before this article was written.) There is one interface where a process gets to provide a raw environment block: CreateProcess. That is, a parent can construct one for its children.

    wchar_t env[] = L"HOME=C:\\Users\\me\0PATH=C:\\bin;C:\\Windows\0";
    CreateProcessW(L"example.exe", ..., env, ...);

Windows imposes some rules upon this environment block:

~~If an element begins with = or does not contain =, CreateProcess fails.~~
Repeated variables are modified to match the first instance. If you’re potentially overriding using a duplicate, put the override first.
Some cases of bad formatting become memory access violations.

As usual for Win32, there are no rules against ill-formed UTF-16, and I could always pass such “UTF-16” through into the child environment block. Keep that in mind even when using the get/set interface.

The SetEnvironmentVariable documentation gives a maximum variable size:

The maximum size of a user-defined environment variable is 32,767 characters. There is no technical limitation on the size of the environment block.

At least on more recent versions of Windows, my experiments proved exactly the opposite. There is no limit on a user-defined environment variables, but environment blocks are limited to 2GiB, for both 32-bit and 64-bit processes. I could even create such huge environments in large address aware 32-bit processes, though the interfaces are prone to error due to allocations problems.

There’s one special case where CreateProcess is illogical, and it’s certainly a case of confusion within its implementation. An environment block is not allowed to be empty. An empty environment is represented as a block containing one empty (zero length) element. That is, two null terminators in a row. It’s the one case where an environment block may contain an element without a =. The logical empty environment block would be just one null terminator, to terminate the block itself, because it contains no variables. You can safely pretend that’s the case when parsing an environment block, as this special case is superfluous.

However, CreateProcess partially enforces this silly, unnecessary special case! If an environment block begins with a null terminator, the next character must be in a mapped memory region because it will read this character. If it’s not mapped, the result is a memory access violation. Its actual value doesn’t matter, and CreateProcess will treat it as though it was another null terminator. Surely someone at Microsoft would have noticed by now that this behavior makes no sense, but I guess it’s kept for backwards compatibility?

The CreateProcess documentation says that “the system uses a sorted environment” but this made no difference in my tests. The word “must” appears in this sentence, but it’s unclear if it applies to sorting, or even outside the special case being discussed. GetEnvironmentVariable works fine on an unsorted environment block. SetEnvironmentVariable maintains sorting, but given an unsorted block it goes somewhere in the middle, probably wherever a bisection happens to land. Perhaps look-ups in sorted blocks are faster, but environment blocks are so small — ~~a maximum of 32K characters~~ (Update: only true for ANSI) — that, in practice, it really does not matter.

Suppose you’re meticulous and want to sort your environment block before spawning a process. How do you go about it? There’s the rub: The official documentation is incomplete! The Changing Environment Variables page says:

All strings in the environment block must be sorted alphabetically by name. The sort is case-insensitive, Unicode order, without regard to locale.

What do they mean by “case-insensitive” sort? Does “Unicode order” mean case folding? A reasonable guess, but no, that’s not how get/set works. Besides, how does “Unicode order” apply to ill-formed UTF-16? Worse, get/set sorting is certainly not “Unicode order” even outside of case-insensitivity! For example, U+1F31E (SUN WITH FACE) sorts ahead of U+FF01 (FULLWIDTH EXCLAMATION MARK) because the former encodes in UTF-16 as U+D83C U+DF1E. Maybe it’s case-insensitive only in ASCII? Nope, π (U+03C0) and Π (U+03A0) are considered identical. Windows uses some kind of case-insensitive, but not case-folded, undocumented early 1990s UCS-2 sorting logic for environment variables.

Update: John Doty suspects the RtlCompareUnicodeString function for sorting. It lines up perfectly with get/set for all possible inputs.

Without better guidance, the only reliable way to “correctly” sort an environment block is to build it with get/set, then retrieve the result with get/free. The algorithm looks like:

Get a copy of the environment with GetEnvironmentStrings.
Walk the environment and call SetEnvironmentVariable on each name with a null pointer as the value. This clears out the environment.
Call SetEnvironmentVariable for each variable in the new environment.
Get a sorted copy of the new environment with GetEnvironmentStrings.

Unfortunately that’s all global state, so you can only construct one new environment block at a time.

If you know all your variable names ahead of time, then none of this is a problem. Determine what Windows thinks the order should be, then use that in your program when constructing the environment block. It’s the general case where this is a challenge, such as a language runtime designed to operate on arbitrary environment variables with behavior congruent to the rest of the system.

There are similar issues with looking up variables in an environment block. How does case-insensitivity work? Sorting is “without regard to locale” but what about when comparing variable names? The documentation doesn’t say. When enumerating variables using get/free, you might read what get/set considers to be duplicates, though at least values will always agree with get/set, i.e. they’re aliases of one variables. Windows maintains that invariant in my tests. The above algorithm would also delete these duplicates.

For example, if someone passed you a “dirty” environment with duplicates, or that was unsorted, this would clean it up in a way that allows get/free to be traversed in order without duplicates.

    wchar_t *env = GetEnvironmentStringsW();

    // Clear out the environment
    for (wchar_t *var = env; *var;) {
        size_t len = wcslen(var);
        size_t split = wcscspn(var, L"=");
        var[split] = 0;
        SetEnvironmentVariableW(var, 0);
        var[split] = '=';
        var += len + 1;
    }

    // Restore the original variables
    for (wchar_t *var = env; *var;) {
        size_t len = wcslen(var);
        size_t split = wcscspn(var, L"=");
        var[split] = 0;
        SetEnvironmentVariableW(var, var+split+1);
        var += len + 1;
    }

    FreeEnvironmentStringsW(env);

On the second pass, SetEnvironmentVariableW will gobble up all the duplicates.

As a final note, the CreateProcess page had said this up until February 2023 about the environment block parameter:

If this parameter is NULL and the environment block of the parent process contains Unicode characters, you must also ensure that dwCreationFlags includes CREATE_UNICODE_ENVIRONMENT.

That seems to indicate it’s virtually always wrong to call CreateProcess without that flag — that is, Windows will trash the child’s environment unless this flag is passed — which is a bonkers default. Fortunately this appears to be wrong, which is probably why the documentation was finally corrected (after several decades). Omitting this flag was fine under all my tests, and I was unable to produce surprising behavior on any system.

In summary:

Prefer get/set for all operations except enumeration
Environment blocks are not necessarily sorted
Repeat variables are forced to the value of the first instance
Variables may contain ill-formed UTF-16
Empty environment blocks have a superfluous special case
~~Entries cannot begin with =~~
Entries must contain at least one =
Sort order is ambiguous, so you cannot reliably do it yourself
Case-insensitivity of names is ambiguous, so rely on get/set
CREATE_UNICODE_ENVIRONMENT necessary only for non-null environment

Update September 2024: Correction from Kasper Brandt regarding variables beginning with =. I misunderstood how it was parsed and came to the wrong conclusion.

"Once" one-time concurrent initialization with an integer

2023-07-31T23:00:41Z

We’ve previously discussed integer barriers, integer queues, and integer wait groups as tiny concurrency utilities. Next let’s tackle “once” initialization, i.e. pthread_once, using an integer. We’ll need only three basic atomic operations — store, load, and increment — and futex wait/wake. It will be zero-initialized and the entire source small enough to fit on an old-fashioned terminal display. The interface will also get an overhaul, more to my own tastes.

If you’d like to skip ahead: once.c

What’s the purpose? Suppose a concurrent program requires initialization, but has no definite moment to do so. Threads are already in motion, and it’s unpredictable which will arrive first, and when. It might be because this part of the program is loaded lazily, or initialization is expensive and only done lazily as needed. A “once” object is a control allowing the first arrival to initialize, and later arrivals to wait until initialization done.

The pthread version has this interface:

pthread_once_t once = PTHREAD_ONCE_INIT;
int pthread_once(pthread_once_t *, void (*init)(void));

It’s deliberately quite limited, and the specification refers to it merely as “dynamic package initialization.” That is, it’s strictly for initializing global package data, not individual objects, and a “once” object must be a static variable, not dynamically allocated. Also note the lack of context pointer for the callback. No pthread implementation I examined was actually so restricted, but the specification is written for the least common denominator, and the interface is clearly designed against more general use.

An example of lazily static table initialization for a cipher:

// Blowfish subkey tables (constants)
static uint32_t blowfish_p[20];
static uint32_t blowfish_s[256];
static pthread_once_t once = PTHREAD_ONCE_INIT;

static void init(void)
{
    // ... populate blowfish_p and blowfish_s with pi ...
}

void blowfish_encrypt(struct blowfish *ctx, void *buf, size_t len)
{
    pthread_once(&once, init);
    // ... lookups into blowfish_p and blowfish_s ...
}

The pthread_once allows blowfish_encrypt to be called concurrently (on different context objects). The first call populates lookup tables and others wait as needed. A good pthread_once will speculate initialization has already completed and make that the fast path. The tables do not require locks or atomics because pthread_once establishes a synchronization edge: initialization happens-before the return from pthread_once.

Go’s sync.Once has a similar interface:

func (o *Once) Do(f func())

It’s more flexible and not restricted to global data, but retains the callback interface.

A new “once” interface

Callbacks are clunky, especially without closures, so in my re-imagining I wanted to remove it from the interface. Instead I broke out exit and entry. The in-between takes the place of the callback and it runs in its original context.

_Bool do_once(int *);
void once_done(int *);

This is similar to breaking “push” and “pop” each into two steps in my concurrent queue. do_once returns true if initialization is required, otherwise it returns false after initialization has completed, i.e. it blocks. The initializing thread signals that initialization is complete by calling once_done. As mentioned, the “once” object would be zero-initialized. Reworking the above example:

// Blowfish subkey tables (constants)
static uint32_t blowfish_p[20];
static uint32_t blowfish_s[256];
static int once = 0;

void blowfish_encrypt(struct blowfish *ctx, void *buf, size_t len)
{
    if (do_once(&once)) {
        // ... populate blowfish_p and blowfish_s with pi ...
        once_done(&once);
    }
    // ... lookups into blowfish_p and blowfish_s ...
}

It gets more interesting when taken beyond global initialization. Here each object is lazily initialized by the first thread to use it:

typedef struct {
    int once;
    // ...
} Thing;

static void expensive_init(Thing *, ptrdiff_t);

static double compute(Thing *t, ptrdiff_t index)
{
    if (do_once(&t->once)) {
        expensive_init(t, index);
        once_done(&t->once);
    }
    // ...
}

int main(void)
{
    // ...
    Thing *things = calloc(1000000, sizeof(Thing));
    #pragma omp parallel for
    for (int i = 0; i < iterations; i++) {
        ptrdiff_t which = random_access(i);
        double r = compute(&things[which], which);
        // ...
    }
    // ...
}

Implementation details

A “once” object must express at least these three states:

Uninitialized
Undergoing initialization
Initialized

To support zero-initialization, (1) must map into zero. A thread observing (1) must successfully transition to (2) before attempting to initialize. A thread observing (2) must wait for a transition to (3). Observing (3) is the fast path, and the implementation should optimize for it.

The trickiest part is the state transition from (1) to (2). If multiple threads are attempting the transition concurrently, only one should “win”. The obvious choice is a compare-and-swap atomic, which will fail if another thread has already made the transition. However, with a more careful selection of state representation, we can do this with just an atomic increment!

The secret sauce: (2) will be any positive value and (3) will be any negative value. The “winner” is the thread that increments from zero to one. Other threads that also observed zero will increment to a different value, after which they behave as though they did not observe (1) in the first place.

I chose shorthand names for the three atomic and two futex operations. Each can be defined with a single line of code — the atomics with compiler intrinsics and the futex with system calls, as they interact with the system scheduler. (See the “four elements” of the wait group article.) Technically it will still work correctly if the futex calls are no-ops, though it would waste time spinning on the slow path. In a real program you’d probably use less pithy names.

static int  load(int *);
static void store(int *, int);
static int  incr(int *);
static void wait(int *, int);
static void wake(int *);

From here it’s useful to work backwards, starting with once_done, because there’s an important detail, another secret sauce ingredient:

void once_done(int *once)
{
    store(once, INT_MIN);
    wake(once);
}

Recall that the “initialized” state (3) is negative. We don’t just pick any arbitrary negative, especially not the obvious -1, but the most negative value. Keep that in mind. Once set, wake up any waiters. Since this is the slow path, we don’t care to avoid the system call if there are no waiters. Now do_once:

_Bool do_once(int *once)
{
    int r = load(once);
    if (r < 0) {
        return 0;
    } else if (r == 0) {
        r = incr(once);
        if (r == 1) {
            return 1;
        }
    }
    while (r > 0) {
        wait(once, r);
        r = load(once);
    }
    return 0;
}

First, check for the fast path. If we’re already in state (3), return immediately. If do_once will be placed in a separate translation unit from the caller, we might extract this check such that it can be inlined at the call site. Once initialization has settled, nobody will be mutating *once, so this will be a fast, uncontended atomic load, though mind your cache lines for false sharing.

If we’re in state (1), try to transition to state (2). If we incremented to 1, we won so tell the caller to initialize. Otherwise continue as though we never saw state (1). There’s an important subtlety easy to miss: Initialization may have already completed before the increment. That is, *once may have been negative for the increment! Fortunately since we chose INT_MIN in once_done, it will stay negative. (Assuming you have less than 2 billion threads contending *once. Ha!) So it’s vital to check r again for negative after the increment, hence while instead of do while.

Losers continuing to increment *once may interfere with the futex wait, but, again, this is the slow path so that’s fine. Eventually we will wake up and observe (3), then give control back to the caller.

That’s all there is to it. If you haven’t already, check out the source including tests for for Windows and Linux: once.c. Suggested experiments to try, particularly under a debugger:

Change INT_MIN to -1.
Change while (r > 0) { ... } to do { ... } while (r > 0);.
Comment out the futex system calls. (Note: will be very slow without also reducing NTHREADS.)

Solving "Two Sum" in C with a tiny hash table

2023-06-26T19:38:18Z

I came across a question: How does one efficiently solve Two Sum in C? There’s a naive quadratic time solution, but also an amortized linear time solution using a hash table. Without a built-in or standard library hash table, the latter sounds onerous. However, a mask-step-index table, a hash table construction suitable for many problems, requires only a few lines of code. This approach is useful even when a standard hash table is available, because by exploiting the known problem constraints, it beats typical generic hash table performance by an order of magnitude (demo).

The Two Sum exercise, restated:

Given an integer array and target, return the distinct indices of two elements that sum to the target.

In particular, the solution doesn’t find elements, but their indices. The exercise also constrains input ranges — important but easy to overlook:

2 <= count <= 10⁴
-10⁹ <= nums[i] <= 10⁹
-10⁹ <= target <= 10⁹

Notably, indices fit in a 16-bit integer with lots of room to spare. In fact, it will fit in a 14-bit address space (16,384) with still plenty of overhead. Elements fit in a signed 32-bit integer, and we can add and subtract elements without overflow, if just barely. The last constraint isn’t redundant, but it’s not readily exploitable either.

The naive solution is to linearly search the array for the complement. With nested loops, it’s obviously quadratic time. At 10k elements, we expect an abysmal 25M comparisons on average.

int16_t count = ...;
int32_t *nums = ...;

for (int16_t i = 0; i < count-1; i++) {
    for (int16_t j = i+1; j < count; j++) {
        if (nums[i]+nums[j] == target) {
            // found
        }
    }
}

The nums array is “keyed” by index. It would be better to also have the inverse mapping: key on elements to obtain the nums index. Then for each element we could compute the complement and find its index, if any, using this second mapping.

The input range is finite, so an inverse map is simple. Allocate an array, one element per integer in range, and store the index there. However, the input range is 2 billion, and even with 16-bit indices that’s a 4GB array. Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed to make it so. This array would be very sparse, at most less than half a percent of its elements populated. That’s a hint: Associative arrays are far more appropriate for representing such sparse mappings. That is, a hash table.

Using Go’s built-in hash table:

func TwoSumWithMap(nums []int32, target int32) (int, int, bool) {
    seen := make(map[int32]int16)
    for i, num := range nums {
        complement := target - num
        if j, ok := seen[complement]; ok {
            return int(j), i, true
        }
        seen[num] = int16(i)
    }
    return 0, 0, false
}

In essence, the hash table folds the sparse 2 billion element array onto a smaller array, with collision resolution when elements inevitably land in the same slot. For this exercise, that small array could be as small as 10,000 elements because that’s the most we’d ever need to track. For folding the large key space onto the smaller, we could use modulo. For collision resolution, we could keep walking the table.

int16_t seen[10000] = {0};

// Find or insert nums[index].
int16_t lookup(int32_t *nums, int16_t index)
{
    int i = nums[index] % 10000;
    for (;;) {
        int16_t j = seen[i] - 1;  // unbias
        if (j < 0) {  // empty slot
            seen[i] = index + 1;  // insert biased index
            return -1;
        } else if (nums[j] == nums[index]) {
            return j;  // match found
        }
        i = (i + 1) % 10000;  // keep looking
    }
}

Take note of a few details:

An empty slot is zero, and an empty table is a zero-initialized array. Since zero is a valid value, and all values are non-negative, it biases values by 1 in the table.
The nums array is part of the table structure, necessary for lookups. The two mappings — element-by-index and index-by-element — share structure.
It uses open addressing with linear probing, and so walks the table until it either either finds the element or hits an empty slot.
The “hash” function is modulo. If inputs are not random, they’ll tend to bunch up in the table. Combined with linear probing makes for lots of collisions. For the worst case, imagine sequentially ordered inputs.
Sometimes the table will almost completely fill, and lookups will be no better than the linear scans of the naive solution.
Most subtle of all: This hash table is not enough for the exercise. The keyed-on element may not even be in nums, and when lookup fails, that element is not inserted in the table. Instead, a different element is inserted. The conventional solution has at least two hash table lookups. In the Go code, it’s seen[complement] for lookups and seen[num] for inserts.

To solve (4) we’ll use a hash function to more uniformly distribute elements in the table. We’ll also probe the table in a random-ish order that depends on the key. In practice there will be little bunching even for non-random inputs.

To solve (5) we’ll use a larger table: 2¹⁴ or 16,384 elements. This has breathing room, and with a power of two we can use a fast mask instead of a slow division (though in practice, compilers usually implement division by a constant denominator with modular multiplication).

To solve (6) we’ll key complements together under the same key. It looks for the complement, but on failure it inserts the current element in the empty slot. In other words, this solution will only need a single hash table lookup per element!

Laying down some groundwork:

typedef struct {
    int16_t i, j;
    _Bool ok;
} TwoSum;

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        // ...
    }
    return r;
}

The seen array is a 32KiB hash table large enough for all inputs, small enough that it can be a local variable. In the loop:

        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;

Compute the complement, then apply a “max” operation to derive a key. Any commutative operation works, though obviously addition would be a poor choice. XOR is similar enough to cause many collisions. Multiplication works well, and is probably better if the ternary produces a branch.

The hash function is multiplication with a randomly-chosen prime. As we’ll see in a moment, step will also add-shift the hash before use. The initial index will be the bottom 14 bits of this hash. For step, recall from the MSI article that it must be odd so that every slot is eventually probed. I shift out 13 bits and then override the 14th bit, so step effectively skips over the 14 bits used for the initial table index.

I used unsigned because I don’t really care about the width of the hash table index, but more importantly, I want defined overflow from all the bit twiddling, even in the face of implicit promotion. As a bonus, it can help in reasoning about indirection: seen indices are unsigned, nums indices are int16_t.

        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }

The step is added before using the index the first time, helping to scatter the start point and reduce collisions. If it’s an empty slot, insert the current element, not the complement — which wouldn’t be possible anyway. Unlike conventional solutions, this doesn’t require another hash and lookup. If it finds the complement, problem solved, otherwise keep going.

Putting it all together, it’s only slightly longer than solutions using a generic hash table:

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;
        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }
    }
    return r;
}

Applying this technique to Go:

func TwoSumWithBespoke(nums []int32, target int32) (int, int, bool) {
    var seen [1 << 14]int16
    for n, num := range nums {
        complement := target - num
        hash := int(num * complement * 489183053)
        mask := len(seen) - 1
        step := hash>>13 | 1
        for i := hash; ; {
            i = (i + step) & mask
            j := int(seen[i] - 1) // unbias
            if j < 0 {
                seen[i] = int16(n) + 1 // bias
                break
            } else if nums[j] == complement {
                return j, n, true
            }
        }
    }
    return 0, 0, false
}

With Go 1.20 this is an order of magnitude faster than map[int32]int16, which isn’t surprising. I used multiplication as the key operator because, in my first take, Go produced a branch for the “max” operation — at a 25% performance penalty on random inputs.

A full-featured, generic hash table may be overkill for your problem, and a bit of hashed indexing with collision resolution over a small array might be sufficient. The problem constraints might open up such shortcuts.

Hand-written Windows API prototypes: fast, flexible, and tedious

2023-05-31T01:38:31Z

I love fast builds, and for years I’ve been bothered by the build penalty for translation units including windows.h. This header has an enormous number of definitions and declarations and so, for C programs, it tends to dominate the build time of those translation units. Most programs, especially systems software, only needs a tiny portion of it. For example, when compiling u-config with GCC, two thirds of the debug build was spent processing windows.h just for 4 types, 16 definitions, and 16 prototypes.

To give a sense of the numbers, here’s empty.c, which does nothing but include windows.h.

#include 

With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):

$ gcc -E empty.c | grep -vc '^$'
82041

With w64devkit this takes my system ~450ms to compile with GCC:

$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s

Compiling an actually empty source file takes ~10ms, so it really is spending practically all that time processing headers. MSVC is a faster compiler, and this extends to processing an even larger windows.h that crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:

$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s

That’s just low enough to be tolerable, but I’d like the situation with GCC to be better. Defining WIN32_LEAN_AND_MEAN reduces the number of included headers, which has a significant effect:

$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s

Precompiled headers

The official solution is precompiled headers. Put all the system header includes, or similar, into a dedicated header, then compile that header into a special format. For example, headers.h:

#define WIN32_LEAN_AND_MEAN
#include 

Then main.c includes windows.h through this header:

#include "headers.h"

int mainCRTStartup(void)
{
    return 0;
}

If I ask GCC to compile headers.h:

$ gcc headers.h

It produces headers.h.gch. When a source includes headers.h, GCC first searches for an appropriate .gch. Not only must the name match, but so must all the definitions at the moment of inclusion: headers.h should always be the first included header, otherwise it may not work. Now when I compile main.c:

$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

Much better! MSVC has a conventional name for this header recognizable to every Visual Studio user: stdafx.h. It works a bit differently, and I’ve never used it myself, but I trust it has similar results.

Precompiled headers requires some extra steps that vary by toolchain. Can we do better? That depends on your definition of “better!”

Artisan, handcrafted prototypes

As mentioned, systems software tends to need only a few declarations: open, read, write, stat, etc. What if I wrote these out manually? A bit tedious, but it doesn’t require special precompiled header handling. It also creates some new possibilities. To illustrate, a CRT-free “hello world” program:

#include 

int mainCRTStartup(void)
{
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    char message[] = "Hello, world!\n";
    DWORD len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

This takes my system half a second to compile — quite long to produce just 26 assembly instructions:

$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!

The program requires prototypes only for GetStdHandle and WriteFile, a definition for STD_OUTPUT_HANDLE, and some typedefs. Starting with the easy stuff, the definition and types look like this:

#define STD_OUTPUT_HANDLE ((DWORD)-11)

typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;

By the way, here’s a cheat code for quickly finding preprocessor definitions, faster than looking them up elsewhere:

$ echo '#include ' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)

Did you catch the pattern? It’s -10 - fd, where fd is the conventional unix file descriptor number: a kind of mnemonic.

Prototypes are a little trickier, especially if you care about 32-bit. The Windows API uses the “stdcall” calling convention, which is distinct from the “cdecl” calling convention on x86, though the same on x64. Of course, you must already be aware of this merely using the API, as your own callbacks must usually be stdcall themselves. Further, API functions are DLL imports and should be declared as such. Putting it together, here’s GetStdHandle:

__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);

This works with both Mingw-w64 and MSVC. MSVC requires __stdcall between the return type and function name, so don’t get clever about it. If you only care about GCC then you can declare both at once using attributes:

HANDLE GetStdHandle(DWORD)
    __attribute__((dllimport,stdcall));

I like to hide all this behind a macro, with a “table” of all my imports listed just below:

#define W32(r) __declspec(dllimport) r __stdcall
W32(HANDLE) GetStdHandle(DWORD);
W32(BOOL)   WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);

In WriteFile you may have noticed I’m taking shortcuts. The “official” definition uses an ugly pointer typedef, LPCVOID, instead of pointer syntax, but I skipped that type definition. I also replaced the last argument, an OVERLAPPED pointer, with a generic pointer. I only need to pass null. I can keep sanding it down to something more ergonomic:

W32(int)    WriteFile(void *, void *, int, int *, void *);

That’s how I typically write these prototypes. I dropped the const because it doesn’t help me. I used signed sizes because I like them better and it’s what I’m usually holding at the call site. But doesn’t changing the signedness potentially break compatibility? It makes no difference to any practical ABI: It’s passed the same way. In general, signedness is a matter for operators, and only some of them — mainly comparisons (<, >, etc.) and division. It’s a similar story for pointers starting with the 32-bit era, so I can choose whatever pointer types are convenient.

In general, I can do anything I want so long as I know my compiler will produce an appropriate function call. These are not standard functions, like printf or memcpy, which are implemented in part by the compiler itself, but foreign functions. It’s no different than teaching an FFI how to make a call. This is also, in essence, how OpenGL and Vulkan work, with applications defining the API for themselves.

Considering all this, my new hello world:

#define W32(r) __declspec(dllimport) r __stdcall
W32(void *) GetStdHandle(int);
W32(int)    WriteFile(void *, void *, int, int *, void *);

int mainCRTStartup(void)
{
    void *stdout = GetStdHandle(-10 - 1);
    char message[] = "Hello, world!\n";
    int len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

You know, there’s a kind of beauty to a program that requires no external definitions. It builds quickly and produces a binary bit-for-bit identical to the original:

$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s

I’ve also been using this to patch over API rough edges. For example, WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus takes OVERLAPPED. These types are explicitly compatible, and only defined separately for annoying technical reasons. I must use the same overlapped object with both APIs at once, meaning I would normally need ugly pointer casts on my Winsock calls, or vice versa with I/O completion ports. But because I’m writing all these definitions myself, I can define a common overlapped structure for both!

Perhaps you’re worried that this would be too fragile. Well, as a legacy software aficionado, I enjoy building and running my programs on old platforms. So far these programs still work properly going back 30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag, it’s always been a bug (now long fixed) in the old operating system, not in my programs or these prototypes. So, in effect, this technique has worked well for the past 30 years!

Writing out these definitions is a bit of a chore, but after paying that price I’ve been quite happy with the results. I will likely continue doing it in the future, at least for non-graphical applications.

My favorite C compiler flags during development

2023-04-29T22:55:25Z

This article was discussed on Hacker News and on reddit.

The major compilers have an enormous number of knobs. Most are highly specialized, but others are generally useful even if uncommon. For warnings, the venerable -Wall -Wextra is a good start, but circumstances improve by tweaking this warning set. This article covers high-hitting development-time options in GCC, Clang, and MSVC that ought to get more consideration.

There’s an irony that the more you use these options, the less useful they become. Given a reasonable workflow, they are a harsh mistress in a fast, tight feedback loop quickly breaking the habits that cause warnings and errors. It’s a kind of self-improvement, where eventually most findings will be false positives. With heuristics internalized, you will be able spot the same issues just reading code — a handy skill during code review.

Static warnings

Traditionally, C and C++ compilers are by default conservative with warnings. Unless configured otherwise, they only warn about the most egregious issues where it’s highly confident. That’s too conservative. For gcc and clang, the first order of business is turning on more warnings with -Wall. Despite the name, this doesn’t actually enable all warnings. (clang has -Weverything which does literally this, but trust me, you don’t want it.) However, that still falls short, and you’re better served enabling extra warnings on with -Wextra.

$ cc -Wall -Wextra ...

That should be the baseline on any new project, and closer to what these compilers should do by default. Not using these means leaving value on the table. If you come across such a project, there’s a good chance you can find bugs statically just by using this baseline. Some warnings only occur at higher optimization levels, so leave these on for your release builds, too.

For MSVC, including clang-cl, a similar baseline is /W4. Though it goes a bit far, warning about use of unary minus on unsigned types (C4146), and sign conversions (C4245). If you’re using a CRT, also disable the bogus and irresponsible “security” warnings. Putting it together, the warning baseline becomes:

$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...

As for gcc and clang, I dislike unused parameter warnings, so I often turn it off, at least while I’m working: -Wno-unused-parameter. Rarely is it a defect to not use a parameter. It’s common for a function to fit a fixed prototype but not need all its parameters (e.g. WinMain). Were it up to me, this would not be part of -Wextra.

I also dislike unused functions warnings: -Wno-unused-function. I can’t say this is wrong for the baseline since, in most cases, ultimately I do want to know if there are unused functions, e.g. to be deleted. But while I’m working it’s usually noise.

If I’m working with OpenMP, I may also disable warnings about unknown pragmas: -Wno-unknown-pragmas. One cool feature of OpenMP is that the typical case gracefully degrades to single-threaded behavior when not enabled. That is, compiling without -fopenmp. I’ll test both ways to ensure I get deterministic results, or just to ease debugging, and I don’t want warnings when it’s disabled. It’s fine for the baseline to have this warning, but sometimes it’s a poor match.

When working with single-precision floats, perhaps on games or graphics, it’s easy to accidentally introduce promotion to double precision, which can hurt performance. It could be neglecting an f suffix on a constant or using sin instead of sinf. Use -Wdouble-promotion to catch such mistakes. Honestly, this is important enough that it should go into the baseline.

#define PI 3.141592653589793
float degs = ...;
float rads = degs * PI / 180;  // warns about promotion

It can be awkward around variadic functions, particularly printf, which cannot receive float arguments, and so implicitly converts. You’ll need a explicit cast to disable the warning. I imagine this is the main reason the warning is not part of -Wextra.

float x = ...;
printf("%.17g\n", (double)x);

Finally, an advanced option: -Wconversion -Wno-sign-conversion. It warns about implicit conversions that may result in data loss. Sign conversions do not have data loss, the implicit conversions are useful, and in my experience they’re not a source of defects, so I disable that part using the second flag (like MSVC /wd4245). The important warning here is truncation of size values, warning about unsound uses of sizes and subscripts. For example:

// NOTE: would be declared/defined via windows.h
typedef uint32_t DWORD;
BOOL WriteFile(HANDLE, const void *, DWORD, DWORD *, OVERLAPPED *);

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    WriteFile(err, msg, len, &out, 0);  // len truncation warning
}

On 64-bit targets, it will warn about truncating the 64-bit len for the 32-bit parameter. To dismiss the warning, you must either address it by using a loop to call WriteFile multiple times, or acknowledge the truncation with an explicit cast and accept the consequences. In this case I may know from context it’s impossible for the program to even construct such a large message, so I’d use an assertion and truncate.

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    assert(len <= 0xffffffff);
    WriteFile(err, msg, (DWORD)len, &out, 0);
}

You might consider changing the interface instead:

void logmsg(char *msg, uint32_t len);

That probably passes the buck and doesn’t solve the underlying problem. The caller may be holding a size_t length, so the truncation happens there instead. Or maybe you keep propagating this change backwards until it, say, dissipates on a known constant. -Wconversion leads to these ripple effects that improves the overall program, which is why I like it.

The catch is that the above warning only happens for 64-bit targets. So you might miss it. The inverse is true in other cases. This is one area where cross-architecture testing can pay off.

Unfortunately since this warning is off the beaten path, it seems like it doesn’t quite get the attention it could use. It warns about simple cases where truncation has been explicitly handled/avoided. For example:

int x = ...;
char digit = '0' + x%10;  // false warning

The '0' is a known constant. The operation x%10 has a known range (-9 to 9). Therefore the addition result has a known range, and all results can be represented in a char. Yet it still warns. This often comes up dealing with character data like this.

In my logmsg fix I had used an assertion to check that no truncation actually occurred. But wouldn’t it be nice if the compiler could generate that for us somehow? That brings us to dynamic checks.

Dynamic run-time checks

Sanitizers have been around for nearly a decade but are still criminally underused. They insert run-time assertions into programs at the flip of a switch typically at a modest performance cost — less than the cost of a debug build. All three major compilers support at least one sanitizer on all targets. In most cases, failing to use them is practically the same as not even trying to find defects. Every beginner tutorial ought to be using sanitizers from page 1 where they teach how to compile a program with gcc. (That this is universally not the case, and that these same tutorials also do not begin with teaching a debugger, is a major, on-going education failure.)

There are multiple different sanitizers with lots of overlap, but Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most general. They are compatible with each other and form a solid, general baseline. To use address sanitizer, at both compile and link time do:

$ cc ... -fsanitize=address ...

It’s even spelled the same way in MSVC. It’s needed at link time because it includes a runtime component. When working properly it’s aware of all allocations and checks all memory accesses that might be out of bounds, producing a run-time error if that occurs. It’s not always appropriate, but most projects that can use it probably should.

UBSan is enabled similarly:

$ cc ... -fsanitize=undefined ...

It adds checks around operations that might be undefined, emitting a run-time error if it occurs. It has an optional runtime component to produce a helpful diagnostic. You can instead insert a trap instruction, which is how I prefer to use it: -fsanitize-trap=undefined. (Until recently it was -fsanitize-undefined-trap-on-error.) This works on platforms where the UBSan runtime is unsupported. Some instrumentation is only inserted at higher optimization levels.

For me, the most useful UBSan check is signed overflow — e.g. computing the wrong result — and it’s instrumentation I miss when not working in C. In programs where this might be an issue, combine it with a fuzzer to search for inputs that cause overflows. This is yet another argument in favor of signed sizes, as UBSan can detect such overflows. (Yes, UBSan optionally instruments unsigned overflow, too, but then you must somehow distinguish intentional from unintentional overflow.)

On Linux, ASan and UBSan strangely do not have debugger-oriented defaults. Fortunately that’s easy to address with a couple of environment variables, which cause them to break on error instead of uselessly exiting:

export ASAN_OPTIONS=abort_on_error=1:halt_on_error=1
export UBSAN_OPTIONS=abort_on_error=1:halt_on_error=1

Also, when compiling you can combine sanitizers like so:

$ cc ... -fsanitize=address,undefined ...

As of this writing, MSVC does not have UBSan, but it does have a similar feature, run-time error checks. Three sub-flags (c, s, u) enable different checks, and /RTCcsu turns them all on. The c flag generates the assertion I had manually written with -Wconversion, and traps any truncation at run time. There’s nothing quite like this in UBSan! It’s so extreme that it’s compatible with neither standard runtime libraries (fortunately not a big deal) nor with ASan.

Caveat: Explicit casts aren’t enough, you must actually truncate variables using a mask in order to pass the check. For example, to accept truncation in the logmsg function:

    WriteFile(err, msg, len&0xffffffff, &out, 0);

Thread Sanitizer (TSan) is occasionally useful for finding — or, more often, proving the presence of — data races. It has a runtime component and so must be used at compile time and link time.

$ cc ... -fsanitize=thread ...

Unfortunately it only works in a narrow context. The target must use pthreads, not C11 threads, OpenMP, nor direct cloning. It must only synchronize through code that was compiled with TSan. That means no synchronization through system calls, especially no futexes. Most non-trivial programs do not meet the criteria.

Debug information

Another common mistake in tutorials is using plain old -g instead of -g3 (read: “debug level 3”). That’s like using -O instead of -O3. It adds a lot more debug information to the output, particularly enums and macros. The extra information is useful and you’re better off having it!

$ cc ... -g3 ...

All the major build systems — CMake, Autotools, Meson, etc. — get this wrong in their standard debug configurations. Producing a fully-featured debug build from these systems is a constant battle for me. Often it’s easier to ignore the build system entirely and cc -g3 **/*.c (plus sanitizers, etc.).

(Short term note: GCC 11, released in March 2021, switched to DWARF5 by default. However, GDB could not access the extra -g3 debug information in DWARF5 until GDB 13, released February 2023. If you have a toolchain from that two year window — except mine because I patched it — then you may also need -gdwarf-4 to switch back to DWARF4.)

What about -Og? In theory it enables optimizations that do not interfere with debugging, and potentially some additional warnings. In practice I still get far too many “optimized out” messages from GDB when I use it, so I don’t bother. Fortunately C is such a simple language that debug builds are nearly as fast as release builds anyway.

On MSVC I like having debug information embedded in binaries, as GCC does, which is done using /Z7.

$ cl ... /Z7 ...

Though I certainly understand the value of separate debug information, /Zi, in some cases. Sometimes I wish the GNU toolchain made this easier.

Summary

My personal rigorous baseline for development using gcc and clang looks like this (all platforms):

$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
     -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
     -fsanitize=undefined -fsanitize-trap ...

While ASan is great for quickly reviewing and evaluating other people’s projects, I don’t find it useful for my own programs. I avoid that class of defects through smarter paradigms (region-based allocation, no null terminated strings, etc.). I also prefer the behavior of trap instruction UBSan versus a diagnostic, as it behaves better under debuggers.

For cl and clang-cl, my personal baseline looks like this:

$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...

I don’t normally need /D_CRT_SECURE_NO_WARNINGS since I don’t use a CRT anyway.

Update: Peter0x44 points out -D_GLIBCXX_DEBUG if you’re working in C++ with libstdc++, including on Windows with Mingw-w64. I agree, this is an excellent option! ASan does not “see” C++ containers, and it fills in some of those gaps.

Practical libc-free threading on Linux

2023-03-23T05:32:41Z

Suppose you’re not using a C runtime on Linux, and instead you’re programming against its system call API. It’s long-term and stable after all. Memory management and buffered I/O are easily solved, but a lot of software benefits from concurrency. It would be nice to also have thread spawning capability. This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls. It only takes about a dozen lines of C, including a few inline assembly instructions.

The catch is that there’s no way to avoid using a bit of assembly. Neither the clone nor clone3 system calls have threading semantics compatible with C, so you’ll need to paper over it with a bit of inline assembly per architecture. This article will focus on x86-64, but the basic concept should work on all architectures supported by Linux. The glibc clone(2) wrapper fits a C-compatible interface on top of the raw system call, but we won’t be using it here.

Before diving in, the complete, working demo: stack_head.c

The clone system call

On Linux, threads are spawned using the clone system call with semantics like the classic unix fork(2). One process goes in, two processes come out in nearly the same state. For threads, those processes share almost everything and differ only by two registers: the return value — zero in the new thread — and stack pointer. Unlike typical thread spawning APIs, the application does not supply an entry point. It only provides a stack for the new thread. The simple form of the raw clone API looks something like this:

long clone(long flags, void *stack);

Sounds kind of elegant, but it has an annoying problem: The new thread begins life in the middle of a function without any established stack frame. Its stack is a blank slate. It’s not ready to do anything except jump to a function prologue that will set up a stack frame. So besides the assembly for the system call itself, it also needs more assembly to get the thread into a C-compatible state. In other words, a generic system call wrapper cannot reliably spawn threads.

void brokenclone(void (*threadentry)(void *), void *arg)
{
    // ...
    long r = syscall(SYS_clone, flags, stack);
    // DANGER: new thread may access non-existant stack frame here
    if (!r) {
        threadentry(arg);
    }
}

For odd historical reasons, each architecture’s clone has a slightly different interface. The newer clone3 unifies these differences, but it suffers from the same thread spawning issue above, so it’s not helpful here.

The stack “header”

I figured out a neat trick eight years ago which I continue to use today. The parent and child threads are in nearly identical states when the new thread starts, but the immediate goal is to diverge. As noted, one difference is their stack pointers. To diverge their execution, we could make their execution depend on the stack. An obvious choice is to push different return pointers on their stacks, then let the ret instruction do the work.

Carefully preparing the new stack ahead of time is the key to everything, and there’s a straightforward technique that I like call the stack_head, a structure placed at the high end of the new stack. Its first element must be the entry point pointer, and this entry point will receive a pointer to its own stack_head.

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    // ...
};

The structure must have 16-byte alignment on all architectures. I used an attribute to help keep this straight, and it can help when using sizeof to place the structure, as I’ll demonstrate later.

Now for the cool part: The ... can be anything you want! Use that area to seed the new stack with whatever thread-local data is necessary. It’s a neat feature you don’t get from standard thread spawning interfaces. If I plan to “join” a thread later — wait until it’s done with its work — I’ll put a join futex in this space:

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    int join_futex;
    // ...
};

More details on that futex shortly.

The clone wrapper

I call the clone wrapper newthread. It has the inline assembly for the system call, and since it includes a ret to diverge the threads, it’s a “naked” function just like with setjmp. The compiler will generate no prologue or epilogue, and the function body is limited to inline assembly without input/output operands. It cannot even reliably reference its parameters by name. Like clone, it doesn’t accept a thread entry point. Instead it accepts a stack_head seeded with the entry point. The whole wrapper is just six instructions:

__attribute((naked))
static long newthread(struct stack_head *stack)
{
    __asm volatile (
        "mov  %%rdi, %%rsi\n"     // arg2 = stack
        "mov  $0x50f00, %%edi\n"  // arg1 = clone flags
        "mov  $56, %%eax\n"       // SYS_clone
        "syscall\n"
        "mov  %%rsp, %%rdi\n"     // entry point argument
        "ret\n"
        : : : "rax", "rcx", "rsi", "rdi", "r11", "memory"
    );
}

On x86-64, both function calls and system calls use rdi and rsi for their first two parameters. Per the reference clone(2) prototype above: the first system call argument is flags and the second argument is the new stack, which will point directly at the stack_head. However, the stack pointer arrives in rdi. So I copy stack into the second argument register, rsi, then load the flags (0x50f00) into the first argument register, rdi. The system call number goes in rax.

Where does that 0x50f00 come from? That’s the bare minimum thread spawn flag set in hexadecimal. If any flag is missing then threads will not spawn reliably — as discovered the hard way by trial and error across different system configurations, not from documentation. It’s computed normally like so:

    long flags = 0;
    flags |= CLONE_FILES;
    flags |= CLONE_FS;
    flags |= CLONE_SIGHAND;
    flags |= CLONE_SYSVSEM;
    flags |= CLONE_THREAD;
    flags |= CLONE_VM;

When the system call returns, it copies the stack pointer into rdi, the first argument for the entry point. In the new thread the stack pointer will be the same value as stack, of course. In the old thread this is a harmless no-op because rdi is a volatile register in this ABI. Finally, ret pops the address at the top of the stack and jumps. In the old thread this returns to the caller with the system call result, either an error (negative errno) or the new thread ID. In the new thread it pops the first element of stack_head which, of course, is the entry point. That’s why it must be first!

The thread has nowhere to return from the entry point, so when it’s done it must either block indefinitely or use the exit (not exit_group) system call to terminate itself.

Caller point of view

The caller side looks something like this:

static void threadentry(struct stack_head *stack)
{
    // ... do work ...
    __atomic_store_n(&stack->join_futex, 1, __ATOMIC_SEQ_CST);
    futex_wake(&stack->join_futex);
    exit(0);
}

__attribute((force_align_arg_pointer))
void _start(void)
{
    struct stack_head *stack = newstack(1<<16);
    stack->entry = threadentry;
    // ... assign other thread data ...
    stack->join_futex = 0;
    newthread(stack);

    // ... do work ...

    futex_wait(&stack->join_futex, 0);
    exit_group(0);
}

Despite the minimalist, 6-instruction clone wrapper, this is taking the shape of a conventional threading API. It would only take a bit more to hide the futex, too. Speaking of which, what’s going on there? The same principal as a WaitGroup. The futex, an integer, is zero-initialized, indicating the thread is running (“not done”). The joiner tells the kernel to wait until the integer is non-zero, which it may already be since I don’t bother to check first. When the child thread is done, it atomically sets the futex to non-zero and wakes all waiters, which might be nobody.

Caveat: It’s not safe to free/reuse the stack after a successful join. It only indicates the thread is done with its work, not that it exited. You’d need to wait for its SIGCHLD (or use CLONE_CHILD_CLEARTID). If this sounds like a problem, consider your context more carefully: Why do you feel the need to free the stack? It will be freed when the process exits. Worried about leaking stacks? Why are you starting and exiting an unbounded number of threads? In the worst case park the thread in a thread pool until you need it again. Only worry about this sort of thing if you’re building a general purpose threading API like pthreads. I know it’s tempting, but avoid doing that unless you absolutely must.

What’s with the force_align_arg_pointer? Linux doesn’t align the stack for the process entry point like a System V ABI function call. Processes begin life with an unaligned stack. This attribute tells GCC to fix up the stack alignment in the entry point prologue, just like on Windows. If you want to access argc, argv, and envp you’ll need more assembly. (I wish doing really basic things without libc on Linux didn’t require so much assembly.)

__asm (
    ".global _start\n"
    "_start:\n"
    "   movl  (%rsp), %edi\n"
    "   lea   8(%rsp), %rsi\n"
    "   lea   8(%rsi,%rdi,8), %rdx\n"
    "   call  main\n"
    "   movl  %eax, %edi\n"
    "   movl  $60, %eax\n"
    "   syscall\n"
);

int main(int argc, char **argv, char **envp)
{
    // ...
}

Getting back to the example usage, it has some regular-looking system call wrappers. Where do those come from? Start with this 6-argument generic system call wrapper.

long syscall6(long n, long a, long b, long c, long d, long e, long f)
{
    register long ret;
    register long r10 asm("r10") = d;
    register long r8  asm("r8")  = e;
    register long r9  asm("r9")  = f;
    __asm volatile (
        "syscall"
        : "=a"(ret)
        : "a"(n), "D"(a), "S"(b), "d"(c), "r"(r10), "r"(r8), "r"(r9)
        : "rcx", "r11", "memory"
    );
    return ret;
}

I could define syscall5, syscall4, etc. but instead I’ll just wrap it in macros. The former would be more efficient since the latter wastes instructions zeroing registers for no reason, but for now I’m focused on compacting the implementation source.

#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))

Now we can have some exits:

__attribute((noreturn))
static void exit(int status)
{
    SYSCALL1(SYS_exit, status);
    __builtin_unreachable();
}

__attribute((noreturn))
static void exit_group(int status)
{
    SYSCALL1(SYS_exit_group, status);
    __builtin_unreachable();
}

Simplified futex wrappers:

static void futex_wait(int *futex, int expect)
{
    SYSCALL4(SYS_futex, futex, FUTEX_WAIT, expect, 0);
}

static void futex_wake(int *futex)
{
    SYSCALL3(SYS_futex, futex, FUTEX_WAKE, 0x7fffffff);
}

And so on.

Finally I can talk about that newstack function. It’s just a wrapper around an anonymous memory map allocating pages from the kernel. I’ve hardcoded the constants for the standard mmap allocation since they’re nothing special or unusual. The return value check is a little tricky since a large portion of the negative range is valid, so I only want to check for a small range of negative errnos. (Allocating a arena looks basically the same.)

static struct stack_head *newstack(long size)
{
    unsigned long p = SYSCALL6(SYS_mmap, 0, size, 3, 0x22, -1, 0);
    if (p > -4096UL) {
        return 0;
    }
    long count = size / sizeof(struct stack_head);
    return (struct stack_head *)p + count - 1;
}

The aligned attribute comes into play here: I treat the result like an array of stack_head and return the last element. The attribute ensures each individual elements is aligned.

That’s it! There’s not much to it other than a few thoughtful assembly instructions. It took doing this a few times in a few different programs before I noticed how simple it can be.

CRT-free in 2023: tips and tricks

2023-02-15T02:12:00Z

Seven years ago I wrote about “freestanding” Windows executables. After an additional seven years of practical experience both writing and distributing such programs, half using a custom-built toolchain, it’s time to revisit these cabalistic incantations and otherwise scant details. I’ve tweaked my older article over the years as I’ve learned, but this is a full replacement and does not assumes you’ve read it. The “why” has been covered and the focus will be on the “how”. Both the GNU and MSVC toolchains will be considered.

I no longer call these “freestanding” programs since that term is, at best, inaccurate. In fact, we will be actively avoiding GCC features associated with that label. Instead I call these CRT-free programs, where CRT stands for the C runtime the Windows-oriented term for libc. This term communicates both intent and scope.

Entry point

You should already know that main is not the program’s entry point, but a C application’s entry point. The CRT provides the entry point, where it initializes the CRT, including parsing command line options, then calls the application’s main. The real entry point doesn’t have a name. It’s just the address of the function to be called by the loader without arguments.

You might naively assume you could continue using the name main and tell the linker to use it as the entry point. You would be wrong. Avoid the name main! It has a special meaning in C gets special treatment. Using it without a conventional CRT will confuse your tools an may cause build issues.

While you can use almost any other name you like, the conventional names are mainCRTStartup (console subsystem) and WinMainCRTStartup (windows subsystem). It’s easy to remember: Append CRTStartup to the name you’d use in a normal CRT-linking application. I strongly recommend using these names because it reduces friction. Your tools are already familiar with them, so you won’t need to do anything special.

int mainCRTStartup(void);     // console subsystem
int WinMainCRTStartup(void);  // windows subsystem

The MSVC linker documentation says the entry point uses the __stdcall calling convention. Ignore this and do not use __stdcall for your entry point! Since entry points may take no arguments, there is no practical difference from the __cdecl calling convention, so it matters little. Rather, the goal is to avoid __stdcall function decorations. In particular, the GNU linker --entry option does not understand them, nor can it find decorated entry points on its own. If you use __stdcall, then the 32-bit GNU linker will silently (!) choose the beginning of your .text section as the entry point. (This bug was fixed in Binutils 2.42, released January 2024. __stdcall entry points now link correctly.)

If you’re using C++, then of course you will also need to use extern "C" so that it’s not name-mangled. Otherwise the results are similarly bad.

If using -fwhole-program, you will need to mark your entry point as externally visible for GCC so that it knows its an entry point. While linkers are familiar with conventional entry point names, GCC the compiler is not. Normally you do not need to worry about this.

__attribute((externally_visible))  // for -fwhole-program
int mainCRTStartup(void)
{
    return 0;
}

The entry point returns int. If there are no other threads then the process will exit with the returned value as its exit status. In practice this is only useful for console programs. Windows subsystem programs have threads started automatically, without warning, and it’s almost certain your main thread is not the last thread. You probably want to use ExitProcess or even TerminateProcess instead of returning. The latter exits more abruptly and can avoid issues with certain subsystems, like DirectSound, not shutting down gracefully: It doesn’t even let them try.

int WinMainCRTStartup(void)
{
    // ...
    TerminateProcess(GetCurrentProcess(), 0);
}

Compilation

Starting with the GNU toolchain, you have two ways to get into “CRT-free mode”: -nostartfiles and -nostdlib. The former is more dummy-proof, and it’s what I use in build documentation. The latter can be a more complicated, but when it succeeds you get guarantees about the result. I use it in build scripts I intend to run myself, which I want to fail if they don’t do exactly what I expect. To illustrate, consider this trivial program:

#include 

int mainCRTStartup(void)
{
    ExitProcess(0);
}

This program uses ExitProcess from kernel32.dll. Compiling is easy:

$ cc -nostartfiles example.c

The -nostartfiles prevents it from linking the CRT entry point, but it still implicitly passes other “standard” linker flags, including libraries -lmingw32 and -lkernel32. Programs can use kernel32.dll functions without explicitly linking that DLL. But, hey, isn’t -lmingw32 the CRT, the thing we’re avoiding? It is, but it wasn’t actually linked because the program didn’t reference it.

$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll

However, -nostdlib does not pass any of these libraries, so you need to do so explicitly.

$ cc -nostdlib example.c -lkernel32

The MSVC toolchain behaves a little like -nostartfiles, not linking a CRT unless you need it, semi-automatically. However, you’ll need to list kernel32.dll and tell it which subsystem you’re using.

$ cl example.c /link /subsystem:console kernel32.lib

However, MSVC has a handy little feature to list these arguments in the source file.

#ifdef _MSC_VER
  #pragma comment(linker, "/subsystem:console")
  #pragma comment(lib, "kernel32.lib")
#endif

This information must go somewhere, and I prefer the source file rather than a build script. Then anyone can point MSVC at the source without worrying about options.

$ cl example.c

I try to make all my Windows programs so simply built.

Stack probes

On Windows, it’s expected that stacks will commit dynamically. That is, the stack is merely reserved address space, and it’s only committed when the stack actually grows into it. This made sense 30 years ago as a memory saving technique, but today it no longer makes sense. However, programs are still built to use this mechanism.

To function properly, programs must touch each stack page for the first time in order. Normally that’s not an issue, but if your stack frame exceeds the page size, there’s a chance it might step over a page. When a function has a large stack frame, GCC inserts a call to a “stack probe” in libgcc that touches its pages in the prologue. It’s not unlike stack clash protection.

For example, if I have a 4kiB local variable:

int mainCRTStartup(void)
{
    char buf[1<<12] = {0};
    return 0;
}

When I compile with -nostdlib:

$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'

It’s trying to link the CRT stack probe. You can disable this behavior with -mno-stack-arg-probe.

$ cc -mno-stack-arg-probe -nostdlib example.c

Or you can just link -lgcc to provide a definition:

$ cc -nostdlib example.c -lgcc

Had you used -nostartfiles, you wouldn’t have noticed because it passes -lgcc automatically. It’s “dummy-proof” because this sort of issue goes away before it comes up, though for the same reason it’s harder to tell exactly what went into a program.

If you disable the probe altogether — my preference — you’ve only solved the linker problem, but the underlying stack commit problem remains and your program may crash. You can solve that by telling the linker to ask the loader to commit a larger stack up front rather than grow it at run time. Say, 2MiB:

$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c

Of course, I wish that this was simply the default behavior because it’s far more sensible! A much better option is to avoid large stack frames in the first place. Allocate locals larger than, say, 1KiB in a scratch arena instead of on the stack.

MSVC doesn’t have libgcc of course, but it still generates stack probes both for growing the stack and for security checks. The latter requires kernel32.dll, so if I compile the same program with MSVC, I get a bunch of linker failures:

$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...

Using /Gs1000000000 turns off the stack probes, /GS- turns off the checks, /stack commits a larger stack:

$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000

Though, as before, better to avoid large stack frames in the first place.

Built-in functions… ugh

The three major C and C++ compilers — GCC, MSVC, Clang — share a common, evil weakness: “built-in” functions. No matter what, they each assume you will supply definitions for standard string functions at link time, particularly memset and memcpy. They do this no matter how many “seriously now, do not use standard C functions” options you pass. When you don’t link a CRT, you may need to define them yourself.

With GCC there’s a catch: it will transform your memset definition — that is, in a function named memset — into a call to itself. After all, it looks an awful lot like memset! This typically manifests as an infinite loop. Use -fno-builtin to prevent GCC from mis-compiling built-in functions.

Even with -fno-builtin, both GCC and Clang will continue inserting calls to built-in functions elsewhere. For example, making an especially large local variable (and using volatile to prevent it from being optimized out):

int mainCRTStartup(void)
{
    volatile char buf[1<<14] = {0};
    return 0;
}

As of this writing, the latest GCC and Clang will generate a memset call despite -fno-builtin:

$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...

To be absolutely pure, you will need to address this in just about any non-trivial program. On the other hand, -nostartfiles will grab a definition from msvcrt.dll for you:

$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll

To be clear, this is a completely legitimate and pragmatic route! You get the benefits of both worlds: the CRT is still out of the way, but there’s also no hassle from misbehaving compilers. If this sounds like a good deal, then do it! (For on-lookers feeling smug: there is no such easy, general solution for this problem on Linux.)

When you write your own definitions, I suggest putting each definition in its own section so that they can be discarded via -Wl,--gc-sections when unused:

__attribute((section(".text.memset")))
void *memset(void *d, int c, size_t n)
{
    // ...
}

So far, for all three compilers, I’ve only needed to provide definitions for memset and memcpy.

Stack alignment on 32-bit x86

GCC expects a 16-byte aligned stack and generates code accordingly. Such is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However, the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal with it, there will likely be unaligned loads. Some may not be valid (e.g. SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a function attribute for this:

__attribute((force_align_arg_pointer))
int mainCRTStartup(void)
{
    // ...
}

GCC will now align the stack in this function’s prologue. Adjustment is only necessary at entry points, as GCC will maintain alignment through its own frames. This includes all entry points, not just the program entry point, particularly thread start functions. Rule of thumb for i686 GCC: If WINAPI or __stdcall appears in a definition, the stack likely requires alignment.

__attribute((force_align_arg_pointer))
DWORD WINAPI mythread(void *arg)
{
    // ...
}

It’s harmless to use this attribute on x64. The prologue will just be a smidge larger. If you’re worried about it, use #ifdef __i686__ to limit it to 32-bit builds.

Putting it all together

If I’ve written a graphical application with WinMainCRTStartup, used large stack frames, marked my entry point as externally visible, plan to support 32-bit builds, and defined a couple of needed string functions, my optimal entry point may look something like:

#ifdef __GNUC__
__attribute((externally_visible))
#endif
#ifdef __i686__
__attribute((force_align_arg_pointer))
#endif
int WinMainCRTStartup(void)
{
    // ...
}

Then my “optimize all the things” release build may look something like:

$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32

Or with MSVC:

$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows

Or if I’m taking it easy maybe just:

$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c

Or with MSVC (linker flags in source):

$ cl /O2 app.c

Let's implement buffered, formatted output

2023-02-13T00:00:00Z

This article was discussed on reddit.

When not using the C standard library, how does one deal with formatted output? Re-implementing the entirety of printf from scratch seems like a lot of work, and indeed it would be. Fortunately it’s rarely necessary. With the right mindset, and considering your program’s actual formatting needs, it’s not as difficult as it might appear. Since it goes hand-in-hand with buffering, I’ll cover both topics at once, including sprintf-like capabilities, which is where we’ll start.

The print-is-append mindset

Buffering amortizes the costs of write (and read) system calls. Many small writes are queued via the buffer into a few large writes. This isn’t just an implementation detail. It’s key in the mindset to tackle formatted output: Printing is appending.

The mindset includes the reverse: Appending is like printing. Consider this next time you reach for strcat or similar. Is this the appropriate destination for this data, or am I just going to print it — i.e. append it to another, different buffer — afterward?

This concept may sound obvious, but consider that there are major, popular programming paradigms where the norm is otherwise. I’ll pick on Python to illustrate, but it’s not alone.

print(f"found {count} items")

This line of code allocates a buffer; formats the value of the variable count into it; allocates a second buffer; copies into it the prefix ("found "), the first buffer, and the suffix (" items"); copies the contents of this second buffer into the standard output buffer; then discards the two temporary buffers. To see for yourself, use the CPython bytecode disassembler on it. (It is pretty neat that string formatting is partially implemented in the compiler and partially parsed at compile time.)

With the print-is-append mindset, you know it’s ultimately being copied into the standard output buffer, and that you can skip the intermediate appending and copying. Avoiding that pessimization isn’t just about the computer’s time, it’s even more about saving your own time implementing formatted output.

In C that line looks like:

printf("found %d items\n", count);

The format string is a domain-specific language (DSL) that is (usually) parsed and evaluated at run time. In essence it’s a little program that says:

Append "found " to the output buffer
Format the given integer into the output buffer
Append " items\n" to the output buffer

For sprintf the output buffer is caller-supplied instead of a buffered stream.

In this implementation we’re doing to skip the DSL and express such “format programs” in C itself. It’s more verbose at the call site, but it simplifies the implementation. As a bonus, it’s also faster since the format program is itself compiled by the C compiler. In your own formatted output implementation you could write a printf that, following the format string, calls the append primitives we’ll build below.

Buffer implementation

Let’s begin by defining an output buffer. An output buffer tracks the total capacity and how much has been written. I’ll include a sticky error flag to simplify error checks. For a first pass we’ll start with a sprintf rather than full-blown printf because there’s nowhere yet for the data to go.

#define MEMBUF(buf, cap) {buf, cap, 0, 0}
struct buf {
    unsigned char *buf;
    int cap;
    int len;
    _Bool error;
};

I’m using unsigned char since these are bytes, best understood as unsigned (0–255), particularly important when dealing with encodings. I also wrote a “constructor” macro, MEMBUF, to help with initialization. Next we need a function to append bytes — the core operation:

void append(struct buf *b, unsigned char *src, int len)
{
    int avail = b->cap - b->len;
    int amount = avail<len ? avail : len;
    for (int i = 0; i < amount; i++) {
        b->buf[b->len+i] = src[i];
    }
    b->len += amount;
    b->error |= amount < len;
}

If there wasn’t room, it copies as much as possible and sets the error flag to indicate truncation. It doesn’t return the error. Rather than check after each append, the caller will check after multiple appends, effectively batching the checks into one check. The typical, expected case is that there is no error, so make that path fast.

Since it’s an easy point to miss: append is the only place in the entire implementation where bounds checking comes into play. Everything else can confidentially throw bytes at the buffer without worrying if it fits. If it doesn’t, the sticky error flag will indicate such at a more appropriate time.

I could have used memcpy for the loop, but the goal is not to use libc. Besides, not using memcpy means we can pass a null pointer without making it a special exception.

append(b, 0, 0);  // append nothing (no-op)

I expect that static strings are common sources for append, so I’ll add a helper macro which gets the length as a compile-time constant. The null terminator will not be used.

#define APPEND_STR(b, s) append(b, s, sizeof(s)-1)

If that’s not clear yet, it will be once you see an example. It’s also useful to append single bytes:

void append_byte(struct buf *b, unsigned char c)
{
    append(b, &c, 1);
}

With primitive appends done, we can build ever “higher-level” appends. For example, to append a formatted long to the buffer:

void append_long(struct buf *b, long x)
{
    unsigned char tmp[64];
    unsigned char *end = tmp + sizeof(tmp);
    unsigned char *beg = end;
    long t = x>0 ? -x : x;
    do {
        *--beg = '0' - t%10;
    } while (t /= 10);
    if (x < 0) {
        *--beg = '-';
    }
    append(b, beg, end-beg);
}

By working from the negative end — recall that the negative range is larger than the positive — it supports the full range of signed long, whatever it happens to be on this host. With less than 50 lines of code we now have enough to format the example:

char message[256];
struct buf b = MEMBUF(message, sizeof(message));

APPEND_STR(&b, "found ");
append_long(&b, count);
APPEND_STR(&b, "items\n");
if (b.error) {
    // truncated
}

We can continue defining append functions for whatever types we need.

void append_ptr(struct buf *b, void *p)
{
    APPEND_STR(b, "0x");
    uintptr_t u = (uintptr_t)p;
    for (int i = 2*sizeof(u) - 1; i >= 0; i--) {
        append_byte(b, "0123456789abcdef"[(u>>(4*i))&15]);
    }
}

struct vec2 { int x, y; };

void append_vec2(struct buf *b, struct vec2 v)
{
    APPEND_STR(&b, "vec2{");
    append_long(&b, v.x);
    APPEND_STR(&b, ", ");
    append_long(&b, v.y);
    append_byte(&b, '}');
}

Perhaps you want features like field width? Add a parameter for it… but only if you need it!

Float formatting

As mentioned before, precise float formatting is challenging because it’s full of edge cases. However, if you only need to output a simple format at reduced precision, it’s not difficult. To illustrate, this nearly matches %f, built atop append_long:

void append_double(struct buf *b, double x)
{
    long prec = 1000000;  // i.e. 6 decimals

    if (x < 0) {
        append_byte(b, '-');
        x = -x;
    }

    x += 0.5 / prec;  // round last decimal
    if (x >= (double)(-1UL>>1)) {  // out of long range?
        APPEND_STR(b, "inf");
    } else {
        long integral = x;
        long fractional = (x - integral)*prec;
        append_long(b, integral);
        append_byte(b, '.');
        for (long i = prec/10; i > 1; i /= 10) {
            if (i > fractional) {
                append_byte(b, '0');
            }
        }
        append_long(b, fractional);
    }
}

Output to a handle

So far this writes output to a buffer and truncates when it runs out of space. Usually we want this going to a sink, like a kernel object whether that be a file, pipe, socket, etc. to which we have a handle like a file descriptor. Instead of truncating, we flush the buffer to this sink, at which point there’s room for more output. The error flag is set if the flush fails, but this is essentially the same concept as before.

In these examples I will use a file descriptor int, but you can use whatever sort of handle is appropriate. I’ll add an fd field to the buffer and a new constructor macro:

#define MEMBUF(buf, cap) {buf, cap, 0, -1, 0}
#define FDBUF(fd, buf, cap) {buf, cap, 0, fd, 0}

struct buf {
    unsigned char *buf;
    int cap;
    int len;
    int fd;
    Bool error;
};

The buffered stream will be polymorphic: Output can go to a memory buffer or to an operating system handle using the same append interface. This is a handy feature standard C doesn’t even have, though POSIX does in the form of fmemopen. Nothing else changes except append, which, if given a valid handle, will flush when full. Attempting to flush a memory buffer sets the error flag.

_Bool os_write(int fd, void *, int);

void flush(struct buf *b)
{
    b->error |= b->fd < 0;
    if (!b->error && b->len) {
        b->error |= !os_write(b->fd, b->buf, b->len);
        b->len = 0;
    }
}

I’ve arranged so that output stops when there’s an error. Also I’m using a hypothetical os_write in the platform layer as a full, unbuffered write. Note that unix write(2) experiences partial writes and so must be used in a loop. Win32 WriteFile doesn’t have partial writes, so on Windows an os_write could pass its arguments directly to the operating system.

The program will need to call flush directly when it’s done writing output, or to display output early, e.g. line buffering. In append we’ll use a loop to continue appending and flushing until the input is consumed or an error occurs.

void append(struct buf *b, unsigned char *src, int len)
{
    unsigned char *end = src + len;
    while (!b->error && src<end) {
        int left = end - src;
        int avail = b->cap - b->len;
        int amount = avail<left ? avail : left;

        for (int i = 0; i < amount; i++) {
            b->buf[b->len+i] = src[i];
        }
        b->len += amount;
        src += amount;

        if (amount < left) {
            flush(b);
        }
    }
}

That completes formatted output! We can now do stuff like:

int main(void)
{
    unsigned char mem[1<<10];  // arbitrarily-chosen 1kB buffer
    struct buf stdout = FDBUF(1, mem, sizeof(mem));
    for (long i = 0; i < 1000000; i++) {
        APPEND_STR(&stdout, "iteration ");
        append_long(&stdout, i);
        append_byte(&stdout, '\n');
        // ...
    }
    flush(&stdout);
    return stdout.error;
}

Except for the lack of format DSL, this should feel familiar.

Let's write a setjmp

2023-02-12T02:23:11Z

This article was discussed on Hacker News.

Yesterday I wrote that setjmp is handy and that it would be nice to have without linking the C standard library. It’s conceptually simple, after all. Today let’s explore some differently-portable implementation possibilities with distinct trade-offs. At the very least it should illuminate why setjmp sometimes requires the use of volatile.

First, a quick review: setjmp and longjmp are a form of non-local goto.

typedef void *jmp_buf[N];
int setjmp(jmp_buf);
void longjmp(jmp_buf, int);

Calling setjmp saves the execution context in a jmp_buf, and longjmp restores this context, returning the thread to this previous point of execution. This means setjmp returns twice: (1) after saving the context, and (2) from longjmp. To distinguish these cases, the first time it returns zero and the second time it returns the value passed to longjmp.

jmp_buf is an array of some platform-specific type and length. I’ll be using void pointers in this article because it’s a register-sized type that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal addresses which eased in working it out.

Using GCC intrinsics

Let’s start with the easiest option. GCC has two intrinsics doing all the hard work for us: __builtin_setjmp and __builtin_longjmp. Its worst case jmp_buf is length 5, but the most popular architectures only use the first 3 elements. Clang supports these intrinsics as well for GCC compatibility.

Be mindful that the semantics are slightly different from the standard C definition, namely that you cannot use longjmp from the same function as setjmp. It also doesn’t touch the signal mask. However, it’s easier to use and you don’t need to worry about volatile.

// NOTE to copy-pasters: semantics differ slightly from standard C
typedef void *jmp_buf[5];
#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp

If you only care about GCC and/or Clang, then that’s it! It works as-is on every supported target and nothing more is needed. As a bonus, it will be more efficient than the libc version, though I should hope that won’t matter in practice. These are so awesome and convenient that I’m already second-guessing myself: “Do I really need to support other compilers…?”

Using assembly

If I want to support more compilers I’ll need to write it myself. It’s also an excuse to dig into the details. The execution context is no more than an array of saved registers, and longjmp is merely restoring those registers. One of the registers is the instruction pointer, and setting the instruction pointer is called a jump.

Since we’re talking about registers, that means assembly. We’ll also need to know the target’s calling convention, so this really narrows things down. This implementation will target x86-64, a.k.a x64, Windows, but it will support MSVC as an additional compiler. So it’s a different kind of portability. I’ll start with GCC via w64devkit then massage it into something MSVC can use.

I mentioned before that setjmp returns twice. So to return a second time we just need to simulate a normal function return. Obviously that includes restoring the stack pointer like the ret instruction, but it means preserving all the non-volatile registers a callee is supposed to preserve. These will all go in the execution context.

The x64 calling convention specifies 9 non-volatile rsp, rsp, rbx, rdi, rsi, r12, r13, r14, and r15. We’ll also need the instruction pointer, rip, making it 10 total.

typedef void *jmp_buf[10];

setjmp assembly

The tricky issue is that we need to save the registers immediately inside setjmp before the compiler has manipulated them in a function prologue. That will take more than mere inline assembly. We’ll start with a naked function, which means that GCC will not create a prologue or epilogue. However, that means no local variables, and the function body will be limited to inline assembly, including a ret instruction for the epilogue.

__attribute__((naked))
int setjmp(jmp_buf buf)
{
    __asm(
        // ...
    );
}

The x64 calling convention uses rcx for the first pointer argument, so that’s where we’ll find buf. I’ve arbitrarily decided to store rip first, then the other registers in order. However, the current value of rip isn’t the one we need. The rip we need was just pushed on top of the stack by the caller. I’ll read that off the stack into a scratch register, rax, and then store it in the first element of buf.

    mov (%rsp), %rax
    mov %rax,  0(%rcx)

The stack pointer, rsp, is also indirect since I want the pointer just before rip was pushed, as it would be just after a ret. I use a lea, load effective address, to add 8 bytes (recall: stack grows down), placing the result in a scratch register, then write it into the second element of buf (i.e. 8 bytes into %rcx).

    lea 8(%rsp), %rax
    mov %rax,  8(%rcx)

Everything else is a matter of elbow grease.

    mov %rbp, 16(%rcx)
    mov %rbx, 24(%rcx)
    mov %rdi, 32(%rcx)
    mov %rsi, 40(%rcx)
    mov %r12, 48(%rcx)
    mov %r13, 56(%rcx)
    mov %r14, 64(%rcx)
    mov %r15, 72(%rcx)

With all work complete, return zero to the caller.

    xor %eax, %eax
    ret

Putting it altogether, and avoiding a -Wunused-variable:

__attribute__((naked,returns_twice))
int setjmp(jmp_buf buf)
{
    (void)buf;
    __asm(
        "mov (%rsp), %rax\n"
        "mov %rax,  0(%rcx)\n"
        "lea 8(%rsp), %rax\n"
        "mov %rax,  8(%rcx)\n"
        "mov %rbp, 16(%rcx)\n"
        "mov %rbx, 24(%rcx)\n"
        "mov %rdi, 32(%rcx)\n"
        "mov %rsi, 40(%rcx)\n"
        "mov %r12, 48(%rcx)\n"
        "mov %r13, 56(%rcx)\n"
        "mov %r14, 64(%rcx)\n"
        "mov %r15, 72(%rcx)\n"
        "xor %eax, %eax\n"
        "ret\n"
    );
}

Also take note of the returns_twice attribute. It informs GCC of this function’s unusual nature, saying the function doesn’t preserve most non-volatile registers, and induces -Wclobbered diagnostics. Technically this means we could get away with saving only rip, rsp, and rbp — exactly as __builtin_setjmp does — but we’ll need the others for MSVC anyway.

longjmp assembly

In longjmp we need to restore all those registers. For purely aesthetic reasons I’ve decided to do it in reverse order. Everything but rip is easy.

    mov 72(%rcx), %r15
    mov 64(%rcx), %r14
    mov 56(%rcx), %r13
    mov 48(%rcx), %r12
    mov 40(%rcx), %rsi
    mov 32(%rcx), %rdi
    mov 24(%rcx), %rbx
    mov 16(%rcx), %rbp
    mov  8(%rcx), %rsp

The instruction set doesn’t have direct access to rip. It will be a jmp instead of mov, but before jumping we’ll need to prepare the return value. The x64 calling convention says the second argument is passed in rdx, so move that to rax, then jmp to the caller. It’s only a 32-bit operand, C int, so edx instead of rdx.

    mov %edx, %eax
    jmp *0(%rcx)

Putting it all together, and adding the noreturn attribute:

__attribute__((naked,noreturn))
void longjmp(jmp_buf buf, int ret)
{
    (void)buf;
    (void)ret;
    __asm(
        "mov 72(%rcx), %r15\n"
        "mov 64(%rcx), %r14\n"
        "mov 56(%rcx), %r13\n"
        "mov 48(%rcx), %r12\n"
        "mov 40(%rcx), %rsi\n"
        "mov 32(%rcx), %rdi\n"
        "mov 24(%rcx), %rbx\n"
        "mov 16(%rcx), %rbp\n"
        "mov  8(%rcx), %rsp\n"
        "mov %edx, %eax\n"
        "jmp *0(%rcx)\n"
    );
}

The C standard says that if ret is zero then longjmp will return 1 from setjmp instead. I leave that detail as a reader exercise. Otherwise this is a complete, working setjmp. It works perfectly when I swap it in for setjmp.h in my u-config test suite.

Considering volatile

Now that you’ve seen the guts, let’s talk about volatile and why it’s necessary. Consider this function, example, which calls a work function that may return through setjmp (e.g. on failure).

void work(jmp_buf);

int example(void)
{
    int r = 0;
    jmp_buf buf;
    if (!setjmp(buf)) {
        // first return
        r = 1;
        work(buf);
    } else {
        // second return
    }
    return r;
}

It stores to r after the first setjmp return, then loads r after the second setjmp return. However, r may have been stored in the execution context. Since it’s used across function calls, it would be reasonable to store this variable in non-volatile register like ebx. If so, it will be restored to its value at the moment of the first call to setbuf, in which case the old r would be read after restoration by longjmp. If it’s not stored in a register, but on the stack, then on the second return the function will read the latest value out of the stack. In practice, if work returns through longjmp, this function may return either 0 or 1, probably determined by the optimization level.

The solution is to qualify r with volatile, which forces the compiler to store the variable on the stack and never cache it in a register.

    volatile int r = 0;

Though since our setbuf is marked returns_twice, GCC will never store r in a register across setjmp calls. This potentially hides a bug in the program that would occur under some other compilers, but GCC will (usually) warn about it.

Pure assembly and MSVC

MSVC doesn’t understand __attribute__ nor the inline assembly, so it cannot compile these functions. I could compile my setjmp with GCC and the rest of the program with MSVC, which means I need two compilers. Instead, I’ll move to pure assembly, assemble with GNU as (TODO: port to MASM?) so we’ll only need a tiny piece of the GNU toolchain.

	.global setjmp
setjmp:
        mov (%rsp), %rax
	mov %rax,  0(%rcx)
	lea 8(%rsp), %rax
	mov %rax,  8(%rcx)
	mov %rbp, 16(%rcx)
	mov %rbx, 24(%rcx)
	mov %rdi, 32(%rcx)
	mov %rsi, 40(%rcx)
	mov %r12, 48(%rcx)
	mov %r13, 56(%rcx)
	mov %r14, 64(%rcx)
	mov %r15, 72(%rcx)
	xor %eax, %eax
	ret

	.globl longjmp
longjmp:
	mov 72(%rcx), %r15
	mov 64(%rcx), %r14
	mov 56(%rcx), %r13
	mov 48(%rcx), %r12
	mov 40(%rcx), %rsi
	mov 32(%rcx), %rdi
	mov 24(%rcx), %rbx
	mov 16(%rcx), %rbp
	mov  8(%rcx), %rsp
	mov %edx, %eax
	jmp *0(%rcx)

Then some declarations in C:

typedef void *jmp_buf[10];
int setjmp(jmp_buf);
_Noreturn void longjmp(jmp_buf, int);

I’ll need to enable C11 for that _Noreturn in MSVC. Assemble, compile, and link:

$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj

That generally works! If I rename to xsetjmp and xlongjmp to avoid conflicting with the CRT definitions, drop them into the u-config test suite in place of setjmp.h, then compile with MSVC, it passes all tests using my alternate implementation in MSVC as well as GCC. Pretty cool!

Takeaway

I’m not sure if I’ll ever use the assembly, but writing this article led me to try the GCC intrinsics, and I’m so impressed I’m still thinking about ways I can use them. My main thought is out-of-memory situations in arena allocators, using a non-local exit to roll back to a savepoint, even if just to return an error. This is nicer than either terminating the program or handling OOM errors on every allocation. Very roughly:

typedef struct {
    size_t cap;
    size_t off;
    void *jmp_buf[5];
} Arena;

// Place an arena and savepoint an out-of-memory jump.
#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))->jmp_buf)

// Place a new arena at the front of the buffer.
Arena *place(void *mem, size_t size)
{
    assert(size >= sizeof(Arena));
    Arena *a = mem;
    a->cap = size;
    a->off = sizeof(Arena);
    return a;
}

void *alloc(Arena *a, size_t size)
{
    size_t avail = a->cap - a->off;
    if (avail < size) {
        __builtin_longjmp(a->jmp_buf, 1);
    }
    void *p = (char *)a + a->off;
    a->off += size;
    return p;
}

Usage would look like:

int compute(void *workmem, size_t memsize)
{
    Arena *arena;
    if (OOM(arena, workmem, memsize)) {
        // jumps here when out of memory
        return COMPUTE_OOM;
    }

    Thing *t = PUSHSTRUCT(arena, Thing);
    // ...

    return COMPUTE_OK;
}

More granular snapshots can be made further down the stack by allocating subarenas out of the main arena. I have yet to try this out in a practical program.

My review of the C standard library in practice

2023-02-11T03:04:11Z

This article was discussed on Hacker News and critiqued on Wandering Thoughts.

In general, when working in C I avoid the standard library, libc, as much as possible. If possible I won’t even link it. For people not used to working and thinking this way, the typical response is confusion. Isn’t that like re-inventing the wheel? For me, libc is a wheel barely worth using — too many deficiencies in both interface and implementation. Fortunately, it’s easy to build a better, simpler wheel when you know the terrain ahead of time. In this article I’ll review the functions and function-like macros of the C standard library and discuss practical issues I’ve faced with them.

Fortunately the flexibility of C-in-practice makes up for the standard library. I already have all the tools at hand to do what I need — not beholden to any runtime.

How does one write portable software while relying little on libc? Implement the bulk of the program as platform-agnostic, libc-free code then write platform-specific code per target — a platform layer — each in its own source file. The platform code is small in comparison: mostly unportable code, perhaps raw system calls, graphics functions, or even assembly. It’s where you get access to all the coolest toys. On some platforms it will still link libc anyway because it’s got useful platform-specific features, or because it’s mandatory.

The discussion below is specifically about standard C. Some platforms provide special workarounds for their standard function shortcomings, but that’s irrelevant. If I need to use a non-standard function then I’m already writing platform-specific code and I might as well take full advantage of that fact, bypassing the original issue entirely by calling directly into the platform.

The rest of this article goes through the standard library listing in the C18 draft mostly in order.

assert and abort

I wrote about the assert macro last year. While C assertions are better than the same in any other language I know — a trap without first unwinding the stack — the typical implementation doesn’t have the courtesy to trap in the macro itself, creating friction. Or worse, it doesn’t trap at all and instead exits the process normally with a non-zero status. It’s not optimized for debuggers.

My non-trivial programs quickly pick up this definition instead, adjusted later as needed:

#define ASSERT(c) if (!(c)) __builtin_trap()

There’s no diagnostic, but I usually don’t want that anyway. The vast majority of the time these are caught in a debugger, and I don’t need or want a diagnostic.

I have no objections to static_assert, but it’s also not part of the runtime.

Math functions

By this I mean all the stuff in math.h, complex.h, etc. It’s good that these are, in practice, pseudo-intrinsics. They’re also one of the more challenging parts of libc to replace. It prioritizes precision more than I usually need, but that’s a reasonable default.

Character classification and mapping

Includes isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit, tolower, and toupper. The interface is misleading, almost maliciously so, and these functions are misused in every case I’ve seen in the wild. If you see #include in a source file then it’s probably defective. I’ve been guilty of it myself. When it’s up to me, these functions are banned without exception.

Their prototypes are all shaped like so:

int isXXXXX(int);

However, the domain of the input is unsigned char plus EOF. Negative arguments, aside from EOF, are undefined behavior, despite the obvious use case being strings. So this is incorrect:

char *s = ...;
if (isdigit(s[0])) {   // WRONG!
    ...
}

If char is signed, as it is on x86, then it’s undefined for arbitrary strings, s. Some implementations even crash on such inputs.

If the argument was unsigned char, then it would at least truncate into range, usually leading to the desired result. (Though not so if passing Unicode code points, which is an odd mistake to make.) Except that it has to accommodate EOF. Why that? These functions are defined for use with fgetc, not strings!

You could patch over it with truncation by masking:

if (isdigit(s[0] & 255)) {
    ...
}

However, you’re still left with locales. This is a bit of global state that changes how a number of libc functions behave, including character classification. While locales have some niche uses, most of the time the behavior is surprising and undesirable. It’s also bad for performance. I’ve developed a habit of using LC_ALL=C before some GNU programs so that they behave themselves. If you’re parsing a fixed format that doesn’t adapt to locale — virtually everything — you definitely do not want locale-based character classification of input.

Since the interface and behavior both unsuited for most uses, you’re better off making your own range checks or lookup tables for your use case. When you name it, probably avoid starting the function with is since it’s reserved.

_Bool xisdigit(char c)
{
    return c>='0' && c<='9';
}

I used char, but this still works fine for naive UTF-8 parsing.

errno

Without libc you don’t have to use this global, hopefully thread-local, pseudo-variable. Good riddance. Return your errors, and use a struct if necessary.

locales

As discussed, locales have some niche uses — formatting dates comes to mind — but what little use they have is trapped behind global state set by setlocale, making it sometimes impossible to use correctly.

On Windows I’ve instead used GetLocaleInfoW to get information like, “What is the local name of the current month?”

setjmp and longjmp

Sometimes tricky to use correctly, particularly with regard to qualifying local variables as volatile. It can compose with region-based allocation to automatically and instantly free all objects created between set and jump. These macros are fine, but don’t overdo it.

variable arguments

Variadic functions are occasionally useful, and the va_start/va_end macros make them possible. These are, unfortunately, notoriously complex because calling conventions do not go out of their way to make them any simpler. They require compiler assistance, and in practice they’re implemented as part of the compiler rather than libc. They’re okay, but I can live without it.

signals

While important on unix-like systems, signals as defined in the C standard library are essentially useless. If you’re dealing with signals, or even something like signals, it will be in platform-specific code that goes beyond the C standard library.

atomics

I’ve used the _Atomic qualifier in examples since it helps with conciseness, but I hardly use it in practice. In part because it has the inconvenient effect of bleeding into APIs and ABIs. As with volatile, C is using the type system to indirectly achieve a goal. Types are not atomic, loads and stores are atomic. Predating standardization, C implementations have been expressing these loads and stores using intrinsics, functions, or macros rather than through types.

The _Atomic qualifier provides access to the most basic and most strict atomic operations without libc. That is, it’s implemented purely in the compiler. However, everything outside that involves libc, and potentially even requires linking a special atomics library.

Even more, one major implementation (MSVC) still doesn’t support C11 atomics. Anywhere I care about using C atomics, I can already use the richer set of GCC built-ins, which Clang also supports. If I’m writing code intended for Windows, I’ll use the interlocked macros, which work across all the compilers for that platform.

stdio

Standard input and output, stdio, is perhaps the primary driving factor for my own routing around libc. Nearly every program does some kind of input or output, but going through stdio makes things harder.

To read or write a file, one must first open it, e.g. fopen. However, all the implementations for one platform in particular does not allow fopen to access most of the file system, so using libc immediately limits the program’s capabilities on that platform.

The standard library distinguishes between “text” and “binary” streams. It makes no difference on unix-like platforms, but it does on others, where input and output are translated. Besides destroying your data, text streams have terrible performance. Opening everything in binary mode is a simple enough work around, but standard input, output, and error are opened as text streams, and there is no standard function for changing them to binary streams.

When using fread, some implementations use the entire buffer as a temporary work space, even if it returns a length less than the entire buffer. So the following won’t work reliably:

char buf[N] = {0};
fread(buf, N-1, 1, f);
puts(buf);

It may print junk after the expected output because fread overwrote the zeroes beyond it.

Streams are buffered, and there’s no reliable access to unbuffered input and output, such as when an application is already buffering, perhaps as a natural consequence of how it works. There’s setvbuf and _IONBF (“unbuffered”), but in at least one case this really just means “one byte at a time.” It’s common for my libc-using programs to end up with double buffering since I can’t reliably turn off stdio buffering.

Typical implementations assume streams will be used by multiple threads, and so every access goes through a mutex. This causes terrible performance for small reads and writes — exactly the case buffering is supposed to most help. Not only is this unusual, such programs are probably broken anyway — oblivious to the still-present race conditions — and so stdio is optimized for the unusual, broken case at the cost of the most needed typical case.

There is no reliable way to interactively input and display Unicode text. The C standard makes vague concessions for dealing with “wide characters” but it’s useless in practice. I’ve tried! The most common need for me is printing a path to standard error such that it displays properly to the user.

Seek offsets are limited to long. Some real implementations can’t even open files large than 2GiB.

Rather than deal with all this, I add a couple of unbuffered I/O functions to the platform layer, then put a small buffered stream implementation in the application which flushes to the platform layer. UTF-8 for text input and output, and if the platform layer detects it’s connected to a terminal or console, it does the appropriate translation. It doesn’t take much to get something more reliable than stdio. The details are the topic for a future article, especially since you might be wondering about formatted output.

As for formatted input, don’t ever bother with scanf.

Numeric conversion

Float conversion is generally a difficult problem, especially if you care about round trips. It’s one of the better and most useful parts of libc. Though even with libc it’s still difficult to get the simplest or shortest round-trip representation. Also, this is an area where changing locales can be disastrous!

The question is then: How much does this matter in your application’s context? There’s a good chance you only need to display a rounded, low-precision representation of a float to users — perhaps displaying a player’s position in a debug window, etc. Or you only need to parse medium-precision non-integral inputs following a relatively simple format. These are not so difficult.

Parsing (atoi, strtol, strtod, etc.) requires null-terminated strings, which is generally inconvenient. These integers likely came from something not null-terminated like a file, and so I need to first append a null terminator. I can’t just feed it a token from a memory-mapped file. Even when using libc, I often write my own integer parser anyway since the libc parsers lack an appropriate interface.

Update: NRK points out that unsigned integer parsing treats negative inputs as in range. This is both surprising and rarely useful. Looking more closely at the specification, I see it is also affected by locale. Given these revelations, I would ban without exception atoi, atol, strtoul, and strtoull, and avoid strtol and strtoll.

Formatting integers is easy. Parsing integers within in narrow range (e.g. up to a million) is easy. Parsing integers to the very limits of the numeric type is tricky because every operation must guard against overflow regardless of signed or unsigned. Fortunately the first two are common and the last is rarely necessary!

Random numbers

We have rand, srand, and RAND_MAX. As a PRNG enthusiast, I could never recommend using this under any circumstances. It’s a PRNG with mediocre output, poor performance, and global state. RAND_MAX being unknown ahead of time makes it even more difficult to make effective use of rand. You can do better on all dimensions with just a few lines of code.

To make matters worse, typical implementations expect it to be accessed concurrently from multiple threads, so they wrap it in a mutex. Again, it optimizes for the unusual, broken case — threads fighting each other over non-deterministic racy results from a deterministic PRNG — at the cost of the typical, sensible case. Programs relying on that mutex are already broken.

Memory allocation

Includes malloc, calloc, realloc, free, etc. Okay, but in practice used too granularly and too much such that many C programs are tangles of lifetimes. Sometimes I wish there was a standard region allocator so that independently-written libraries could speak a common, sensible, caller-controlled allocation interface.

A major standardization failure here has been not moving size computations into the allocators themselves. calloc is a start: You say how big and how many, and it works out the total allocation, checking for overflow. There should be more of this, even if just to discourage individual allocations and encourage group allocations.

There are some edge cases around zero sizes, like malloc(0), and the standard leaves the behavior a bit too open ended. However, if your program is so poorly structured such that it may possibly pass zero to malloc then you have bigger problems anyway.

Communication with the environment

getenv is straightforward, though I’d prefer to just access the environment block directly, a la the non-standard third argument to main.

exit is fine, but atexit is jank.

system is essentially useless in practice.

Sorting and searching

qsort is ~~fine~~poor because it lacks a context argument. Quality varies. Not difficult to implement from scratch if necessary. I rarely need to sort.

Similar story for bsearch. Though if I need a binary search over an array, bsearch probably isn’t sufficient because I usually want to find lower and upper bounds of a range.

Multi-byte encodings and wide characters

mblen, mbtowc, mbtowc, wctomb, mbstowcs, and wcstombs are connected to the locale system and don’t necessarily operate on any particular encodings like UTF-8, which makes them unreliable. This is the case for all the other wide character functionality, which is quite a few functions. Fortunately I only ever need wide characters on one platform in particular, not in portable code.

More recently are mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb where the “wide” side is specified (UTF-16, UTF-32) but not the multi-byte side. Limited support in implementations and not particularly useful.

Strings

Like ctype.h, string.h is another case where everything is terrible, and some functions are virtually always misused.

memcpy, memmove, memset, and memcmp are fine except for one issue: it is undefined behavior to pass a null pointer to these functions, even with a zero size. That’s ridiculous. A null pointer legitimately and usefully points to a zero-sized object. As mentioned, even malloc(0) is permitted to behave this way. These functions would be fine if not for this one defect.

strcpy, strncpy, strcat, and strncat have no legitimate uses and their use indicates confusion. As such, any code calling them is suspect and should receive extra scrutiny. In fact, I have yet to see a single correct use of strncpy in a real program. (Usage hint: the length argument should refer to the destination, not the source.) When it’s up to me, these functions are banned without exception. This applies equally to non-standard versions of these functions like strlcpy.

strlen has legitimate uses, but is used too often. It should only appear at system boundaries when receiving strings of unknown size (e.g. argv, getenv), and should never be applied to a static string. (Hint: you can use sizeof on those.)

When I see strchr, strcmp or strncmp I wonder why you don’t know the lengths of your strings. On the other hand, strcspn, strpbrk, strrchr, strspn, and strstr do not have mem equivalents, though the null termination requirement hurts their usefulness.

strcoll and strxfrm depend on locale and so are at best niche. Otherwise unpredictable. Avoid.

memchr is fine except for the aforementioned null pointer restriction, though it comes up less often here.

strtok has hidden global state. Besides that, how long is the returned token? It knew the length before it returned. You mean I have to call strlen to find out? Banned.

strerror has an obvious, simple, robust solution: return a pointer to a static string in a lookup table corresponding to the error number. No global state, thread-safe, re-entrant, and the returned string is good until the program exits. Some implementations do this, but unfortunately it’s not true for at least one real world implementation, which instead writes to a shared, global buffer. Hopefully you were avoiding errno anyway.

Threads

Introduced in C11, but never gained significant traction. Anywhere you can use C threads you can use pthreads, which are better anyway.

Besides, thread creation probably belongs in the platform layer anyway.

Time functions

Fairly niche, and I can’t remember using any of these except for time and clock for seeding.

Wrap-up

I hand-waved away a long list of vestigial wide character functions, but the above is pretty much all there is to the C standard library. The only things I miss when avoiding it altogether are the math functions, and occasionally setjmp/longjmp. Everything else I can do better myself, with little difficulty, starting from the platform layer.

All of the C implementations I had in mind above are very old. They will rarely, if ever, change, just accrue. There isn’t a lot of innovation happening in this space, which is fine since I like stable targets. If you would like to see interesting innovation, check out what Cosmopolitan Libc is up to. It’s what I imagine C could be if it continued evolving along practical dimensions.

u-config: a new, lean pkg-config clone

2023-01-18T06:39:51Z

This article was discussed on Hacker News.

In my common SDL2 mistakes listing, the first was about winging it instead of using the sdl2-config script. It’s just one of three official options for portably configuring SDL2, but I had dismissed the others from consideration. One is the pkg-config facility common to unix-like systems. However, the SDL maintainers recently announced SDL3, which will not have a sdl3-config. The concept has been deprecated in favor of the existing pkg-config option. I’d like to support this on w64devkit, except that it lacks pkg-config — not the first time this has come up. So last weekend I wrote a new pkg-config from scratch with first-class Windows support: u-config (“micro-config”). It will serve as pkg-config in w64devkit starting in the next release.

Ultimately pkg-config’s entire job is to find named .pc text files in one of several predetermined locations, read fields from them, then write those fields to standard output. Additional search directories may be supplied through the $PKG_CONFIG_PATH environment variable. At a high level there’s really not much to it.

As a concrete example, here’s a hypothetical example.pc which might live in /usr/lib/pkgconfig.

prefix = /usr
major = 1
minor = 2
patch = 3
version = ${major}.${minor}.${patch}

Name: Example Library
Description: An example of a .pc file
Version: ${version}
Requires: zlib >= 1.2, sdl2
Libs: -L${prefix}/lib -lexample
Libs.private: -lm
Cflags: -I${prefix}/include
Cflags.private: -DEXAMPLE_STATIC

If you invoke pkg-config with --cflags you get the Cflags field. With --libs, you get the Libs field. With --static, you also get the “private” fields. It will also recursively pull in packages mentioned in Requires. The prefix variable is more than convention and is designed to be overridden (and u-config does so by default). In theory pkg-config is supposed to be careful about maintaining argument order and removing redundant arguments, but in practice… well, pkg-config’s actual behavior often makes little sense. We’ll get to that.

For SDL2, where you might use:

$ cc app.c $(sdl2-config --cflags --libs)

You could instead use:

$ eval cc app.c $(pkg-config sdl2 --cflags --libs)

Which is still a build command that works uniformly for all supported platforms, even cross-compiling, given a correctly-configured pkg-config. For w64devkit, the first command requires placing the directory containing sdl2-config on your $PATH. The second instead requires placing the directory containing sdl2.pc in your $PKG_CONFIG_PATH. To upgrade to SDL3, replace the sdl2 with sdl3 in the second command.

Why two when you can have three?

There are already two major, mostly-compatible pkg-config implementations: the original from freedesktop.org (2001), and pkgconf (2011). Both ostensibly support Windows, but in practice this support is second class, which is a reason why I hadn’t included one in w64devkit. A lot of hassle for what is a ultimately a relatively simple task.

As for the original pkg-config, I’ve been unable to produce a functioning Windows build. It’s obvious from the compiler warnings that there are many problems, and my builds immediately crash on start. I’d try debugging it, except that I’ve been cross-compiling this whole time. I cannot build it on Windows because (1) GNU Autotools and (2) pkg-config ~~requires~~wants pkg-config as a build dependency. That’s right, you have to bootstrap pkg-config! Remember, this is a tool whose entire job is to copy some bits of text from a text file to its output. One could use pkg-config as a case study of accidental complexity, and this is just the beginning.

Update: It was pointed out that I wouldn’t need the full, two-stage bootstrap just for my debugging scenario.

The bootstrap issue is part of pkgconf’s popularity as an alternative. It’s also a tidier code base, does a far better job of sorting and arranging its outputs than the original pkg-config, and its overall behavior makes more sense. However, despite its three independent build systems, pkgconf is still annoying to build, not to mention its memory corruption bugs. We’ll get to that, too.

Considering pkg-config’s relatively simple job, obtaining one shouldn’t be this difficult! I could muddle through until one or the other worked, or I could just write my own. I’m glad I did, since I’m extremely happy with the results.

u-config implementation

As of this writing, u-config is about 2,000 lines of C. It doesn’t support every last pkg-config feature, nor will it ever. The goal is to support support existing pkg-config based builds, not make more of them. So, for example, features for debugging .pc files are omitted. Some features are of dubious usefulness (--errors-to-stdout) even if they’d be simple to implement; there are already way too many flags. Other features clearly don’t work correctly — either not as documented or the results don’t make sense — so I skipped those as well.

It comes in two flavors: “generic” C and Windows. The former works on any system with a C99 compiler. In fact, it only uses these 9 standard library functions:

exit
fclose
ferror
fflush
fopen
fread
fwrite
getenv
malloc

That is, it needs to open .pc files, read from them, close those handles, write to standard output and standard error, check for I/O errors, and exactly once call malloc to allocate a block of memory for an arena allocator. It’s not even important the streams are buffered because u-config does its own buffering. Not that it would be useful, but porting to an unhosted 16-bit microcontroller, with fopen implemented as a virtual file system, would be trivial. (You know… it could be dropped into busybox-w32 as a new app with little effort…)

It’s also a unity build — compiled as a single translation unit — so building u-config is as easy as it gets:

$ cc -o pkg-config generic_main.c

Reminder: the original pkg-config cannot even be built without a bootstrapping step.

Since standard C functions are implemented poorly on Windows, but also so that it can do some smarter self-configuration at run-time based on the .exe location, the Windows platform layer calls directly into Win32 and no C runtime (CRT) is used. Input .pc files are memory mapped. Internally u-config is all UTF-8, and the platform layer does the Unicode translations at the Win32 boundaries for paths, arguments, environment variables, and console outputs.

Building is slightly more complicated:

$ cc -o pkg-config -nostartfiles win32_main.c

Implementation highlights

Greenfield projects present a great opportunity for trying new things, and this is no exception. Contrary to my usual style, I decided I would make substantial use of typedef and capitalize all the type names.

typedef int Bool;
typedef unsigned char Byte;

typedef struct {
    Byte *s;
    Size len;
} Str;

typedef struct {
    Str head;
    Str tail;
    Bool ok;
} Cut;

I like it! It makes the type names stand apart, avoids conflicts with variable names, and cuts down the visual noise of struct. I’ve more recently realized that const is doing virtually nothing for me — it has never prevented me from making a mistake — so I left it out (aside from static lookup tables). That’s even more visual noise gone, and reduced cognitive load.

In recent years I’ve been convinced that unsigned sizes were a serious error, probably even one of the great early computing mistakes, and that sizes and subscripts should be signed. Not only that, pkg-config has no business dealing with gigantic objects! We’re talking about short strings and tiny files. If it ends up with a large object, then there’s a defect somewhere — either in itself or the system — and it should abort. Therefore sizes and subscripts are a natural int!

typedef int Size;
typedef unsigned Usize;
#define Size_MAX (Size)((Usize)-1 >> 1)
#define SIZEOF(x) (Size)(sizeof(x))

The Usize is just for the occasional bit-twiddling, like in Size_MAX, and not for regular use. However, u-config objects are no smaller by this decision because the unused space is nearly always padded on 64-bit machines. Further, the x86-64 code is about 5% larger with 32-bit sizes compared to 64-bit sizes — opposite my expectation. Curious.

You might have noticed that Str type above. Aside from interfaces with the host that make it mandatory, u-config makes no use of null-terminated strings anywhere. Every string is a pointer and a size. There’s even a macro to do this for string literals:

#define S(s) (Str){(Byte *)s, SIZEOF(s)-1}

Then I can use and pass them casually:

    if (equals(realname, S("pkg-config"))) {
        // ...
    }

    *insert(arena, &global, S("pc_sysrootdir")) = S("/");

    return startswith(arg, S("-I"));

Like strings in other languages, I can also slice out the middle of strings without copying, handy for parsing and constructing paths. It also works well with memory-mapped .pc files since I can extract tokens from them for use directly in data structures without copying.

That leads into the next item: How does one free or manipulate a data structure where the different parts are arbitrarily allocated across static storage, heap storage, and memory mapped files? The hash tables in u-config are exactly this, the keys themselves allocated in every possible fashion. Don’t you have to keep track of how pointed-at part is allocated? No! The individual objects do not have individual lifetimes due to the arena allocator. The gist of it:

typedef struct {
    Str mem;
    Size off;
} Arena;

static void *alloc(Arena *a, Size size)
{
    ASSERT(size >= 0);
    Size avail = a->mem.len - a->off;
    if (avail < size) {
        oom();
    }
    Byte *p = a->mem.s + a->off;
    a->off += size;
    return p;
}

Since it’s passed often, arena parameters are conventionally named a throughout the program and are always the first argument when needed. If it runs out of memory, it bails. On 32-bit and 64-bit hosts, the default arena is 256MiB. If pkg-config needs more than that, then something’s seriously wrong and it should give up.

While u-config could quite reasonably never “free” (read: reuse) memory, it does do so in practice. In some cases it computes a temporary result, then resets the arena to an earlier state to discard its allocations. A simplified, hypothetical:

    for (int i = 0; ...) {
        Arena tmparena = *a;
        // Use only tmparena in the loop
        Env env = {0};
        Str value = fmtint(&tmparena, i);
        *insert(&tmparena, &env, S("i")) = value;
        // ...
        // allocations freed when tmparena goes out of scope
    }

I had mentioned that u-config does its own output buffering. It’s an object I call an Out, modeled loosely after a Plan 9 bio or a Go bufio.Writer. It has a destination “file descriptor”, a memory buffer, and an integer to track the fill level of the buffer.

typedef struct {
    Str buf;
    Size fill;
    Arena *a;
    int fd;
} Out;

Output bytes are copied into the buffer. When it fills, the buffer is automatically emptied into the file descriptor. The caller can manually flush the buffer at any time, and it’s up to the caller to do so before exiting the program.

But wait, what’s the Arena pointer doing in there? That’s a little extra feature of my own invention! I can open a stream on an arena, writes into the stream go into a growing buffer, and “closing” the stream gives me a string allocated in the arena with the written content. The arena is held in order to manage all this. It’s also locked out from other allocations until the stream is closed. The entire implementation is only about a dozen lines of code.

What use is this? It’s nice when I might want to output either to standard output or to a memory buffer for further use. It’s even more useful when I need to build a string but don’t know its final length ahead of time.

The variable expansion function is both cases. Given a string like ${version} I want to recursively interpolate until there’s nothing left to interpolate. The output could go to standard output to print it out, or into a string for further use. For example, here I have my global variable environment global, a package pkg, its environment (pkg->env), and I want to expand its Version: field, pkg->version.

    Out mem = newmembuf(a);
    expand(&mem, global, pkg, pkg->version);
    Str version = finalize(&mem);

Or I just print it to standard output, and the value is free to expand beyond what would fit in memory since it flushes as it goes:

    Out out = newoutput(1);  // 1 == standard output
    expand(&out, global, pkg, pkg->version);
    flush(&out);

I’m particularly happy about this, and I’m sure I’ll use such “arena streams” again in the future.

Subtleties

While pkgconf tries, and succeeds at, being a faithful (if smarter) clone, in certain ways u-config more closely follows pkg-config’s behavior. For example, pkg-config behaves as though it concatenates all its positional arguments with commas in between, then re-tokenizes them like a Requires field. For example, these commands are all equivalent:

$ pkg-config 'sdl2 > 2' --libs
$ pkg-config 'sdl2 >' --libs 2
$ pkg-config sdl2 --libs '> 2'
$ pkg-config --libs 'sdl2 > 2'

pkgconf does not copy this behavior, but u-config does. Similarly, the original .pc format has undocumented, arcane quoting syntax that sort of works like shell quotes. I tried to match this closely in u-config, while pkgconf tries to be more logical. For example, pkg-config allows this:

quote = "
Cflags: "-I${prefix}/include${quote}

Where the ${quote} will actually close the quote. I retained this but pkgconf did not.

Does anyone use quoting? On my own system I have one package using quotes, but it’s probably a mistake since they’re used improperly. In theory, everyone should be quoting almost everything. For example, this is a very common Cflags:

Cflags: -I${prefix}/include

If a crazy person — or well-known multinational corporation — comes along puts has a space in their system’s installation “prefix”, this .pc will not work. The output would be:

-I/Program Files/include

Actually, that’s a lie. I suspect that’s the intended output, and it’s the output of pkgconf and u-config, but pkg-config instead outputs this head-scratcher:

Files/include -I/Program

Seeing this sort of thing repeatedly is why I have little concern with matching every last pkg-config nuance. Regardless, this parses as two arguments, but if written with quotes:

Cflags: "-I${prefix}/include"

Then pkg-config will escape spaces in the expansion:

-I/Program\ Files/include

This will actually work correctly in the eval context where pkg-config is intended for use (read: not command substitution). I’ve made u-config automatically quote the prefix if it contains spaces, so it will work correctly despite the lack of .pc file quotes when the library is under a path containing a space.

Here’s a fun input. pkg-config has its own billion laughs:

v9=lol
v8=${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}
v7=${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}
v6=${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}
v5=${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}
v4=${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}
v3=${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}
v2=${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}
v1=${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}
v0=${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}
Name: One Billion Laughs
Version: ${v0}
Description: Don't install this!

That expands to 1,000,000,001 “lol” (an extra for good luck!) and in theory --modversion will print it out:

$ pkg-config --modversion lol.pc

Some different outcomes:

pkg-config will expand it in memory and see it to the bitter end, using however many GiBs are necessary. Add a few more lines and your computer will thrash. By the way, bash-completion will ask pkg-config load .pc files named in the command when completing further arguments. Ask me how I know.
u-config could fully output it with only a few kB of memory if directed to a “file descriptor” output, but alas, the Version field must be processed in memory for comparison with another version string, so it doesn’t attempt to do so. It runs out of arena memory and gives up. That’s a feature, especially if you’re using bash-completion.
pkgconf I had built with Address Sanitizer in case it found anything, and boy did it. This input overflows a stack variable and then ASan kills it. I’m unsure what’s supposed to happen next, but I suspect silent truncation.

But that’s a crazy edge case right? Well, it also overflows on empty .pc files, or for all sorts of inputs. I probed both pkg-config and pkgconf with weird inputs to learn how it’s supposed to work, and it was rather irritating having pkgconf crash for so many of them. Someone on the project ought to do testing with ASan sometime. Important note: This is not a security vulnerability!

Further, as you might notice when you build it, pkgconf first tries to link the system strlcpy, if it exists. Failing that, it uses its own version. That’s one of the annoying details about building it. However, using strlcpy never, ever makes sense! Now that I think about it, there’s probably a connection with those buffer overflows.

In general, neither pkg-config nor pkgconf fare well when fuzz tested with sanitizers.

Conclusions

I had a lot of fun writing u-config, and I’m excited about this new addition to w64devkit. Despite my pkg-config grumbling, it is neat that it’s established this de facto standard and encouraged a distributed database of .pc files to exist, at least as documentation if not for a mechanical process like this.

For u-config, there’s still more testing to do, and I’m still open to picking up more behaviors from pkg-config or pkgconf where they make sense. Though given its primary use case — building software on Windows without a package manager — it will probably never be stressed hard enough to matter. Further, w64devkit does not include any .pc files of its own, and since I do not intend to add libraries — that is, beyond the standard language libraries and Windows SDK — that probably won’t change.

If you’d like to try it early, build it with w64devkit, toss in on your PATH, point PKG_CONFIG_PATH at a library with .pc files, and try it out. It already works flawlessly with at least SDL2.

SDL2 common mistakes and how to avoid them

2023-01-08T02:09:26Z

This article was discussed on reddit.

SDL has grown on me over the past year. I didn’t understand its value until viewing it in the right lens: as a complete platform and runtime replacing the host’s runtime, possibly including libc. Ideally an SDL application links exclusively against SDL and otherwise not directly against host libraries, though in practice it’s somewhat porous. With care — particularly in avoiding mistakes covered in this article — that ideal is quite achievable for C applications that fit within SDL’s feature set.

SDL applications are always interesting one way or another, so I like to dig in when I come across them. The items in this article are mistakes I’ve either made myself or observed across many such passion projects in the wild.

Mistake 1: Not using `sdl2-config`

This shell script comes with SDL2 and smooths over differences between platforms, even when cross compiling. It informs your compiler where to find and how to link SDL2. The script even works on Windows if you have a unix shell, such as via w64devkit. Use it as a command substitution at the end of the build command, particularly when using --libs. A one-shot or unity build (my preference) looks like so:

$ cc app.c $(sdl2-config --cflags --libs)

Or under separate compilation:

$ cc -c app.c $(sdl2-config --cflags)
$ cc app.o $(sdl2-config --libs)

Alternatively, static link by replacing --libs with --static-libs, though this is discouraged by the SDL project. When dynamically linked, users can, and do, trivially substitute a different SDL2 binary, such as one patched for their system. In my experience, static linking works reliably on Windows but poorly on Linux.

Alternatively, use the general purpose pkg-config. Don’t forget eval!

$ eval cc app.c $(pkg-config sdl2 --cflags --libs)

I wrote a pkg-config for Windows specifically for this case.

Caveats:

Some circumstances require special treatment, and sdl2-config may be too blunt a tool. That’s fine, but generally prefer sdl2-config as the default approach.
sdl2-config does not support extensions such as SDL2_image, so you will need to use pkg-config. Personally I don’t think they’re worth the trouble when there’s stb, or QOI instead of PNG.
There’s an alternative build option using CMake, without any use of sdl2-config, but I won’t discuss it here.

Mistake 2: Including `SDL2/SDL.h`

A lot of examples, including tutorials linked from the official SDL website, have SDL2/ in their include paths. That’s because they’re making mistake 1, not using sdl2-config, and are instead relying on Linux distributions having installed SDL2 in a place coincidentally accessible through that include path.

This is annoying when SDL2 not installed there, or if I don’t want it using the system’s SDL2. Worse, it can result in subtly broken builds as it mixes and matches different SDL installations. The correct SDL2 include is the following:

#include "SDL.h"

Note the quotes, which helps prevent picking up an arbitrary system header by accident. When carefully and narrowly targeting SDL-the-platform, this will be the only “system” include anywhere in your application.

Mistake 3: Not surrendering `main`

A conventional SDL application has a main function defined in its source, but despite the name, this is distinct from C main. To smooth over platform differences, SDL may rename the application’s main to SDL_main and substitute its own C main. Because of this, main must have the conventional argc/argv prototype and must return a value. (As a special case, C permits main to implicitly return 0, so it’s an easy mistake to make.)

With this in mind, the bare minimum SDL2 application:

#include "SDL.h"

int main(int argc, char **argv)
{
    return 0;
}

Caveat: Like with sdl2-config, some special circumstances require control over the application entry point — see SDL_MAIN_HANDLED and SDL_SetMainReady — but that should be reserved until there’s a need.

One such special case is avoiding linking a CRT on Windows. In principle it’s this simple:

#include "SDL.h"

int WinMainCRTStartup(void)
{
    SDL_SetMainReady();
    // ...
    return 0;
}

Then it’s the usual compiler and linker flags:

$ cc -nostdlib -o app.exe app.c $(sdl2-config --cflags --libs)

This will create a tiny .exe that doesn’t link any system DLL, just SDL2.dll. Quite platform agnostic indeed!

$ objdump -p app.exe | grep -Fi .dll
        DLL Name: SDL2.dll

Alas, as of this writing, this does not work reliably. SDL2’s accelerated renderers on Windows do not clean up properly in SDL_QuitSubSystem nor SDL_Quit, so the process cannot exit without calling ExitProcess in kernel32.dll (or similar). This is still an open experiment.

Mistake 4: Using the SDL wiki for API documentation

The SDL wiki is not authoritative documentation, merely a convenient web-linkable — and downloadable (see “offline html”) — information source. However, anyone who’s spent time on it can tell you it’s incomplete. The authoritative API documentation is the SDL headers, which fortunately are already on hand for building SDL applications. The SDL maintainers themselves use the headers, not the wiki.

If, like me, you’re using ctags, this is actually good news! With a bit of configuration, you can jump to any bit of SDL documentation at any time in your editor, treating the SDL headers like a hyperlinked wiki built into your editor. Just like building, sdl2-config can tell ctags where find those headers:

$ ctags -a -R --kinds-c=dept $(sdl2-config --prefix)/include/SDL2

I’m using -a (--append) to append to the tags file I’ve already generated for my own program, -R (--recurse) to automatically find all the headers, and --kinds-c=dept capture exactly the kinds of symbols I care about — #define, enum, prototypes, typedef — no more no less.

In Vim I CTRL-] over any SDL symbol to jump to its documentation, and then I can use it again within its documentation comment to jump further still to any symbols it mentions, then finally use the jump or tag stack to return. As long as I have t in 'complete' ('cpt'), which is the default, I can also “tab”-complete any SDL symbol using the tags table. There are a few rough edges here and there, but overall it’s a solid editing paradigm.

By the way, with sdl2-config in your $PATH, all the above works out of the box in w64devkit! That’s where I’ve mostly been working with SDL.

Mistake 5: Using stdio streams

A common bit of code in real SDL programs and virtually every tutorial:

if (SDL_Init(...)) {
    fprintf(stderr, "SDL_Init(): %s\n", SDL_GetError());
    return 1;
}

This is not ideal:

fprintf is not part of the SDL platform. This is going behind SDL’s back, reaching around the abstraction to a different platform. Strictly speaking, this API may not even be available to an SDL application.
SDL applications are graphical, so stderr is likely disconnected from anything useful. Few would ever see this message.

Fortunately SDL provides two alternatives:

SDL_Log: like C printf, but SDL will strive to connect it to somewhere useful. If the application was launched from a terminal or console, SDL will find it and hook it up to the logger. On Windows, if there’s a debugger attached, SDL will use OutputDebugString to send logs to the debugger.
SDL_ShowSimpleMessageBox: using any means possible, attempt to display a message to the user. Like SDL_Log, it’s safe to use before/without initializing SDL subsystems.

If you’re paranoid, you could even use both:

if (SDL_Init(...)) {
    SDL_ShowSimpleMessageBox(
        SDL_MESSAGEBOX_ERROR, "SDL_Init()", SDL_GetError(), 0
    );
    SDL_Log("SDL_Init(): %s", SDL_GetError());
    return 1;
}

Though note that SDL_ShowSimpleMessageBox can fail, which will set a new, different error message for SDL_Log!

There’s a similar story again with fopen and loading assets. SDL has an I/O API, SDL_RWops. It’s probably better than the host’s C equivalent, particularly with regards to paths. If you’re not already embedding your assets, use the SDL API instead.

Mistake 6: Using `SDL_RENDERER_ACCELERATED`

This flag — and its surrounding bit set, SDL_RendererFlags — are a subtle design flaw in the SDL2 API. Its existence is misleading, causing to widespread misuse. It does not help that the documentation, both header and wiki, is incomplete and unclear. The SDL_CreateRenderer function accepts a bit set as its third argument, and it serves two simultaneous purposes:

Indicates mandatory properties of the renderer. Examples: “must use accelerated rendering,” “must use software rendering,” “must support vertical synchronization (vsync).” Drivers without the chosen properties are skipped.
If SDL_RENDERER_PRESENTVSYNC is set, also enables vsync in the created render.

The common mistake is thinking that this bit indicates preference: “prefer an accelerated renderer if possible”. But it really means “accelerated renderer or bust.”

Given a zero for renderer flags, SDL will first attempt to create an accelerated renderer. Failing that, it will then attempt to create a software renderer. A software renderer fallback is exactly the behavior you want! After all, this fallback is one of the primary features of the SDL renderer API. This is so straightforward there are no caveats.

Mistake 7: Not accounting for vsync

For a game, you probably ought to enable vsync in your renderer. The hint: You’re using SDL_PollEvent in your main event loop. Otherwise you will waste lots of resources rendering thousands of frames per second. If my laptop fan spins up running your SDL application, it’s probably because you didn’t do this. The following should be the most conventional SDL renderer configuration:

r = SDL_CreateRenderer(window, -1, SDL_RENDERER_PRESENTVSYNC);

The software renderer supports vsync, so it will not be excluded from the driver search when vsync is requested.

That’s only for SDL renderers. If you’re using OpenGL, set a non-zero SDL_GL_SetSwapInterval so that SDL_GL_SwapWindow synchronizes. For the other rendering APIs, consult their documentation. (I can only speak to SDL and OpenGL from experience.)

Caveat: Beware accidentally relying on vsync for timing in your game. You don’t want your game’s physics to depend on the host’s display speed. Even the pros make this mistake from time to time.

However, if you’re not making a game – perhaps instead an IMGUI application without active animations — there’s a good chance you don’t need or want vsync. The hint: You’re using SDL_WaitEvent in your main event loop.

In summary, graphical SDL applications fall into one of two cases:

SDL_PollEvent with vsync
SDL_WaitEvent without vsync

Mistake 8: Using `assert.h` instead of `SDL_assert`

Alright, this one isn’t so common, but I’d like to highlight it. The SDL_assert macro is fantastic, easily beating assert.h which doesn’t even break in the right place. It uses SDL to present a user interface to the assertion, with support for retrying and ignoring. It also works great under debuggers, breaking exactly as it should. I have nothing but praise for it, so don’t pass up the chance to use it when you can.

While I’m at it: during developing and testing, always always always run your application under a debugger. Don’t close the debugger, just launch through it again after rebuilding. Also, enable UBSan and ASan when available for the extra assertions.

SDL wishlist

For months I had wondered why SDL provides no memory allocation API. I’m fine if it doesn’t have a general purpose allocator since I just want to grab a chunk of host memory for an arena. However, SDL does have allocations functions — SDL_malloc, etc. I didn’t know about them until I stopped making mistake 4.

It was the same story again with math functions: I’d like not to stray from SDL as a platform, but what if I need transcendental functions? I could whip up crude implementations myself, but I’d prefer not. SDL has those too: SDL_sin, etc. Caveat: The math.h functions are built-ins, and compilers use that information to better optimize programs, e.g. cool stuff like -mrecip, or SIMD vectorization. That cannot be done with SDL’s equivalents.

I’m surprised SDL has no random number generator considering how important it is to games. Since I prefer to handle this myself, I don’t mind that so much, but it does leave a lot of toy programs out there calling C rand. I would like SDL if provided a single, good seed early during startup. There isn’t even a wall clock function for the classic srand(time(0)) seeding event! My solution has been to mix event timestamps into the random state:

static Uint32 rand32(Uint64 *);

Uint64 rng = 0;
for (SDL_Event e; SDL_PollEvent(&e);) {
    rng ^= e.common.timestamp;
    rand32(&rng);  // stir
    switch (e.type) { /* ... */ }
}

As I learn more in the future, I may come back and add to this list. At the very least I expect to use SDL increasingly in my own projects.

QOI is now my favorite asset format

2022-12-18T03:45:44Z

This article was discussed on Hacker News.

The Quite OK Image (QOI) format was announced late last year and finalized into a specification a month later. Initially dismissive, a revisit has shifted my opinion to impressed. The format hits a sweet spot in the trade-off space between complexity, speed, and compression ratio. Also considering its alpha channel support, QOI has become my default choice for embedded image assets. It’s not perfect, but at the very least it’s a solid foundation.

Since I’m now working with QOI images, I need a good QOI viewer, and so I added support to my ill-named pbmview tool, which I wrote to serve the same purpose for Netpbm. I will continue to use Netpbm as an output format, especially for raw video output, but no longer will I use it for an embedded asset (nor re-invent yet another RLE over Netpbm).

I was dismissive because the website claimed, and still claims today, QOI images are “a similar size” to PNG. However, for the typical images where I would use PNG, QOI is around 3x larger, and some outliers are far worse. The 745 PNGs on my blog — a perfect test corpus for my own needs — convert to QOIs 2.8x larger on average. The official QOI benchmark has much better results, 1.3x larger, but that’s because it includes a lot of photography where PNG and QOI both do poorly, making QOI seem more comparable.

However, as I said, QOI’s strength is its trade-off sweet spot. The specification is one page, and an experienced developer can write a complete implementation from scratch in a single sitting. My own implementation is about 100 lines of libc-free C for each of the encoder and decoder. With error checking removed, my decoder is ~600 bytes of x86 object code — a great story for embedding alongside assets. It’s more complex than Netpbm or farbfeld, but it’s far simpler than BMP. I’ve already begun experimenting with converting assets to QOI, and the results have so far exceeded my expectations.

To my surprise, the encoder was easier to write than the decoder. The format is so straightforward such that two different encoders will produce the identical files. There’s little room for specialized optimization, and no meaningful “compression level” knob.

Criticism

There are a lot of dimensions on which QOI could be improved, but most cases involve trade-offs, e.g. more complexity for better compression. The areas where QOI could have been strictly better, the dimensions on which it is not on the Pareto frontier, are more meaningful criticisms — missed opportunities. My criticisms of this kind:

Big endian fields are an odd choice for a 2020s file format. Little endian dominates the industry, and it would have made for a slightly smaller decoder footprint on typical machines today if QOI used little endian.
The header has two flags and spends an entire byte on each. It should have instead had a flag byte, with two bits assigned to these flags. One flag indicates if the alpha channel is important, and the other selects between two color spaces (sRGB, linear). Both flags are only advisory.
The 4-channel encoded pixel format is ABGR (or RGBA), placing the alpha channel next to the blue channel. This is somewhat unconventional. A decoder is likely to use a single load into 32-bit integer, and ideally it’s already in the desired format or close to it. A few times already I’ve had to shuffle the RGB bytes within the 32-bit sample to be compatible with some other format. QOI channel ordering is arbitrary, and I would have chosen ARGB (when viewed as little endian).
The QOI hash function operates on channels individually, with individual overflow, making it slower and larger than necessary. The hash function should have been over a packed 32-bit sample. I would have used a multiplication by a carefully-chosen 32-bit integer, then a right shift using the highest 6 bits of the result for the index.

More subjective criticisms that might count as having trade-offs:

Given a “flag byte” (mentioned above) it would have been free to assign another flag bit indicating pre-multiplied alpha, also still advisory. You want to use pre-multiplied alpha for your assets, and the option store them this way would help.
There’s an 8-byte end-of-stream marker — a bit excessive — deliberately an invalid encoding so that reads past the end of the image will result in a decoding error. I probably would have chosen a dead simple 32-bit checksum of packed 32-bit images samples, even if literally a sum.

Of course, you’re not obligated to follow QOI exactly to spec for your own assets, so you could always use a modified QOI with one or more of these tweaks. That’s what I meant about it being a solid foundation: You don’t have to start from scratch with some custom RLE. Since the format is so simple, you can easily build your own tools — as I’ve already begun doing myself — so you don’t need to rely on tools supporting your QOI fork.

Minimalist API

I’m really happy with my QOI implementation, particularly since it’s another example of a minimalist C API: no allocating, no input or output, and no standard library use. As usual, the expectation is that it’s in the same translation unit where it’s used, so it’s likely inlined into callers.

The encoder is streaming — it accepts and returns only a little bit of input and output at a time. It has three functions and one struct with no “public” fields:

struct qoiencoder qoiencoder(void *buf, int w, int h, const char *flags);
int qoiencode(struct qoiencoder *, void *buf, unsigned color);
int qoifinish(struct qoiencoder *, void *buf);

The first function initializes an encoder and writes a fixed-length header into the QOI buffer. The flags field is a mode string, like fopen. I would normally use bit flags, but this is a little experiment. The second function encodes a single pixel into the QOI buffer, returning the number of bytes written (possibly zero). The last flushes any encoding state and writes the end-of-stream marker. There are no errors. My typical use so far looks like:

char buf[16];
struct qoiencoder q = qoiencoder(buf, width, height, "a");
fwrite(buf, QOIHDRLEN, 1, file);
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        // ... compute 32-bit ABGR sample at (x, y) ...
        fwrite(buf, qoiencode(&q, buf, abgr), 1, file);
    }
}
fwrite(buf, qoifinish(&q, buf), 1, file);
fflush(file);
return ferror(file);

This appends encoder outputs to a buffered stream, but it could just as well accumulate directly into a larger buffer, advancing the write pointer a little after each call.

The decoder is two functions, but its struct has some “public” fields.

struct qoidecoder {
    int width, height;
    _Bool alpha, srgb, error;
    // ...
};
struct qoidecoder qoidecoder(const void *buf, int len);
static unsigned qoidecode(struct qoidecoder *);

The input is not streamed and the entire buffer must be loaded into memory at once — not too bad since it’s compressed, and perhaps even already loaded as part of the executable image — but the output is streamed, delivering one packed 32-bit ABGR sample per call. The decoder makes no assumptions about the output format, and the caller unpacks samples and stores them in whatever format is appropriate (shader texture, etc.).

To make it easier to use, my decoder range checks to guarantee that width and height can be multiplied without overflow. Unlike encoding, there may be errors due to invalid input, including that failed range check. The decoder error flag is “sticky” and the decoder returns zero samples when in an error state, so callers can wait to check for errors until the end. (Though if you’re only decoding embedded assets, then there are no practical errors, and checks can be removed/ignored.)

Example usage, copied almost verbatim from a real program:

int loadimage(Image *image, const uint8_t *qoi, int len)
{
    struct qoidecoder q = qoidecoder(qoi, len);
    if (/* image dimensions too large */) {
        return 0;
    }
    image->width  = q.width;
    image->height = q.height;
    int count = q.width * q.height;
    for (int i = 0; i < count; i++) {
        unsigned abgr = qoidecode(&q);
        image->data[4*i+0] = abgr >> 16;
        image->data[4*i+1] = abgr >>  8;
        image->data[4*i+2] = abgr >>  0;
        image->data[4*i+3] = abgr >> 24;
    }
    return !q.error;
}

Note the aforementioned awkward RGB shuffle.

It’s safe to say that I’m excited about QOI, and that it now has a permanent slot on my developer toolbelt.

I solved the Dandelions paper-and-pencil game

2022-10-12T03:02:27Z

I’ve been reading Math Games with Bad Drawings, a great book well-aligned to my interests. It’s given me a lot of new, interesting programming puzzles to consider. The first to truly nerd snipe me was Dandelions (full rules), an asymmetric paper-and-pencil game invented by the book’s author, Ben Orlin. Just as with British Square two years ago — and essentially following the same technique — I wrote a program that explores the game tree sufficiently to play either side perfectly, “solving” the game in its standard 5-by-5 configuration.

The source: dandelions.c

The game is played on a 5-by-5 grid where one player plays the dandelions, the other plays the wind. Players alternate, dandelions placing flowers and wind blowing in one of the eight directions, spreading seeds from all flowers along the direction of the wind. Each side gets seven moves, and the wind cannot blow in the same direction twice. The dandelions’ goal is to fill the grid with seeds, and the wind’s goal is to prevent this.

Try playing a few rounds with a friend, and you will probably find that dandelions is difficult, at least in your first games, as though it cannot be won. However, my engine proves the opposite: The dandelions always win with perfect play. In fact, it’s so lopsided that the dandelions’ first move is irrelevant. Every first move is winnable. If the dandelions blunder, typically wind has one narrow chance to seize control, after which wind probably wins with any (or almost any) move.

For reasons I’ll discuss later, I only solved the 5-by-5 game, and the situation may be different for the 6-by-6 variant. Also, unlike British Square, my engine does not exhaustively explore the entire game tree because it’s far too large. Instead it does a minimax search to the bottom of the tree and stops when it finds a branch where all leaves are wins for the current player. Because of this, it cannot maximize the outcome — winning as early as possible as dandelions or maximizing the number of empty grid spaces as wind. I also can’t quantify the exact size of tree.

Like with British Square, my game engine only has a crude user interface for interactively exploring the game tree. While you can “play” it in a sense, it’s not intended to be played. It also takes a few seconds to initially explore the game tree, so wait for the >> prompt.

Bitboard seeding

I used bitboards of course: a 25-bit bitboard for flowers, a 25-bit bitboard for seeds, and an 8-bit set to track which directions the wind has blown. It’s especially well-suited for this game since seeds can be spread in parallel using bitwise operations. Shift the flower bitboard in the direction of the wind four times, ORing it into the seeds bitboard on each shift:

int wind;
uint32_t seeds, flowers;

flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;

Of course it’s a little more complicated than this. The flowers must be masked to keep them from wrapping around the grid, and wind may require shifting in the other direction. In order to “negative shift” I actually use a rotation (notated with >>> below). Consider, to rotate an N-bit integer left by R, one can right-rotate it by N-R — ex. on a 32-bit integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a negative wind that goes in the other direction:

flowers >>> (wind & 31);

With such a “programmable shift” I can implement the bulk of the game rules using a couple of tables and no branches:

// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;

The masks clear out the column/row about to be shifted “out” so that it doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5 times.

Bitboard packing and canonicalization

The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58 bits, which fits in a 64-bit integer with bits to spare. How incredibly convenient! So I represent the game state using a 64-bit integer, using a packing like I did with British Square. The bottom 25 bits are the seeds, the next 25 bits are the flowers, and the next 8 is the wind set.

000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS

Even more convenient, I could reuse my bitboard canonicalization code from British Square, also a 5-by-5 grid packed in the same way, saving me the trouble of working out all the bit sieves. I only had to figure out how to transpose and flip the wind bitset. Turns out that’s pretty easy, too. Here’s how I represent the 8 wind directions:

567
4 0
321

Flipping this vertically I get:

321
4 0
567

Unroll these to show how old maps onto new:

old: 01234567
new: 07654321

The new is just the old rotated and reversed. Transposition is the same story, just a different rotation. I use a small lookup table to reverse the bits, and then an 8-bit rotation. (See revrot.)

To determine how many moves have been made, popcount the flower bitboard and wind bitset.

int moves = POPCOUNT64(g & 0x3fffffffe000000);

To test if dandelions have won:

int win = (g&0x1ffffff) == 0x1ffffff;

Since the plan is to store all the game states in a big hash table — an MSI double hash in this case — I’d like to reserve the zero value as a “null” board state. This lets me zero-initialize the hash table. To do this, I invert the wind bitset such that a 1 indicates the direction is still available. So the initial game state looks like this (in the real program this is accounted for in the previously-discussed turn popcount):

#define GAME_INIT ((uint64_t)255 << 50)

The remaining 6 bits can be used to cache information about the rest of tree under this game state, namely who wins from this position, and this serves as the “value” in the hash table. Turns out the bitboards are already noisy enough that a single xorshift makes for a great hash function. The hash table, including hash function, is under a dozen lines of code.

// Find the hash table slot for the given game state.
uint64_t *lookup(uint64_t *ht, uint64_t g)
{
    uint64_t hash = g ^ g>>32;
    size_t mask = (1L << HASHTAB_EXP) - 1;
    size_t step = hash>>(64 - HASHTAB_EXP) | 1;
    for (size_t i = hash;;) {
        i = (i + step)&mask;
        if (!ht[i] || ht[i]&0x3ffffffffffffff == g) {
            return ht + i;
        }
    }
}

To explore a 6-by-6 grid I’d need to change my representation, which is part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit integer, so I’d need to double my storage requirements, which are already strained.

Computational limitations

Due to the way seeds spread, game states resulting from different moves rarely converge back to a common state later in the tree, so the hash table isn’t doing much deduplication. Exhaustively exploring the entire game tree, even cutting it down to an 8th using canonicalization, requires substantial computing resources, more than I personally have available for this project. So I had to stop at the slightly weaker form, find a winning branch rather than maximizing a “score.”

I configure the program to allocate 2GiB for the hash table, but if you run just a few dozen games off the same table (same program instance), each exploring different parts of the game tree, you’ll exhaust this table. A 6-by-6 doubles the memory requirements just to represent the game, but it also slows the search and substantially increases the width of the tree, which grows 44% faster. I’m sure it can be done, but it’s just beyond the resources available to me.

Dandelion Puzzles

As a side effect, I wrote a small routine to randomly play out games in search for “mate-in-two”-style puzzles. The dandelions have two flowers to place and can force a win with two specific placements — and only those two placements — regardless of how the wind blows. Here are two of the better ones, each involving a small trick that I won’t give away here (note: arrowheads indicate directions wind can still blow):

There are a variety of potential single-player puzzles of this form.

Cooperative: place a dandelion and pick the wind direction
Avoidance: don’t seed a particular tile
Hard ground: certain tiles can’t grow flowers (but still get seeded)
Weeding: as wind, figure out which flower to remove before blowing

There could be a whole “crossword book” of such dandelion puzzles.

How to build a WaitGroup from a 32-bit integer

2022-10-05T03:19:07Z

Go has a nifty synchronization utility called a WaitGroup, on which one or more goroutines can wait for concurrent task completion. In other languages, the usual task completion convention is joining threads doing the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup replaces joins. Building a WaitGroup using typical, portable primitives is a messy affair involving constructors and destructors, managing lifetimes. However, on at least Linux and Windows, we can build a WaitGroup out of a zero-initialized integer, much like my 32-bit queue and 32-bit barrier.

In case you’re not familiar with it, a typical WaitGroup use case in Go:

var wg sync.WaitGroup
for _, task := range tasks {
    wg.Add(1)
    go func(t Task) {
        // ... do task ...
        wg.Done()
    }(task)
}
wg.Wait()

I zero-initialize the WaitGroup, the main goroutine increments the counter before starting each task goroutine, each goroutine decrements the counter when done, and the main goroutine waits until the counter reaches zero. My goal is to build the same mechanism in C:

void workfunc(task t, int *wg)
{
    // ... do task ...
    waitgroup_done(wg);
}

int main(void)
{
    // ...
    int wg = 0;
    for (int i = 0; i < ntasks; i++) {
        waitgroup_add(&wg, 1);
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);
    // ...
}

When it’s done, the WaitGroup is back to zero, and no cleanup is required.

I’m going to take it a little further than that: Since its meaning and contents are explicit, you may initialize a WaitGroup to any non-negative task count! In other words, waitgroup_add is optional if the total number of tasks is known up front.

    int wg = ntasks;
    for (int i = 0; i < ntasks; i++) {
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);

A sneak peek at the full source: waitgroup.c

The four elements (of synchronization)

To build this WaitGroup, we’re going to need four primitives from the host platform, each operating on an int. The first two are atomic operations, and the second two interact with the system scheduler. To port the WaitGroup to a platform you need only implement these four functions, typically as one-liners.

static int  load(int *);           // atomic load
static int  addfetch(int *, int);  // atomic add-then-fetch
static void wait(int *, int);      // wait on change at address
static void wake(int *);           // wake all waiters by address

The first two should be self-explanatory. The wait function waits for the pointed-at integer to change its value, and the second argument is its expected current value. The scheduler will double-check the integer before putting the thread to sleep in case it changes at the last moment — in other words, an atomic check-then-maybe-sleep. The wake function is the other half. After changing the integer, a thread uses it to wake all threads waiting for the pointed-at integer to change. Together, this mechanism is known as a futex.

I’m going to simplify the WaitGroup semantics a bit in order to make my implementation even simpler. Go’s WaitGroup allows adding negatives, and the Add method essentially does double-duty. My version forbids adding negatives. That means the “add” operation is just an atomic increment:

void waitgroup_add(int *wg, int delta)
{
    addfetch(wg, delta);
}

Since it cannot bring the counter to zero, there’s nothing else to do. The “done” operation can decrement to zero:

void waitgroup_done(int *wg)
{
    if (!addfetch(wg, -1)) {
        wake(wg);
    }
}

If the atomic decrement brought the count to zero, we finished the last task, so we need to wake the waiters. We don’t know if anyone is actually waiting, but that’s fine. Some futex use cases will avoid making the relatively expensive system call if nobody’s waiting — i.e. don’t waste time on a system call for each unlock of an uncontended mutex — but in the typical WaitGroup case we expect a waiter when the count finally goes to zero. That’s the common case.

The most complicated of the three is waiting:

void waitgroup_wait(int *wg)
{
    for (;;) {
        int c = load(wg);
        if (!c) {
            break;
        }
        wait(wg, c);
    }
}

First check if the count is already zero and return if it is. Otherwise use the futex to wait for it to change. Unfortunately that’s not exactly the semantics we want, which would be to wait for a certain target. This doesn’t break the wait, but it’s a potential source of inefficiency. If a thread finishes a task between our load and wait, we don’t go to sleep, and instead try again. However, in practice, I ran thousands of threads through this thing concurrently and I couldn’t observe such a “miss.” As far as I can tell, it’s so rare it doesn’t matter.

If this was a concern, the WaitGroup could instead be a pair of integers: the counter and a “latch” that is either 0 or 1. Waiters wait on the latch, and the latch is modified (atomically) when the counter transitions to or from zero. That gives waiters a stable value on which to wait, proxying the counter. However, since this doesn’t seem to matter in practice, I prefer the elegance and simplicity of the single-integer WaitGroup.

Four elements: Linux

With the WaitGroup done at a high level, we now need the per-platform parts. Both GCC and Clang support GNU-style atomics, so I’ll just assume these are available on Linux without worrying about the compiler. The first two functions wrap these built-ins:

static int load(int *p)
{
    return __atomic_load_n(p, __ATOMIC_SEQ_CST);
}

static int addfetch(int *p, int addend)
{
    return __atomic_add_fetch(p, addend, __ATOMIC_SEQ_CST);
}

For wait and wake we need the futex(2) system call. In an attempt to discourage its direct use, glibc doesn’t wrap this system call in a function, so we must make the system call ourselves.

static void wait(int *p, int current)
{
    syscall(SYS_futex, p, FUTEX_WAIT, current, 0, 0, 0);
}

static void wake(int *p)
{
    syscall(SYS_futex, p, FUTEX_WAKE, INT_MAX, 0, 0, 0);
}

The INT_MAX means “wake as many as possible.” The other common value is 1 for waking a single waiter. Also, these system calls can’t meaningfully fail, so there’s no need to check the return value. If wait wakes up early (e.g. EINTR), it’s going to check the counter again anyway. In fact, if your kernel is more than 20 years old, predating futexes, and returns ENOSYS (“Function not implemented”), it will still work correctly, though it will be incredibly inefficient.

Four elements: Windows

Windows didn’t support futexes until Windows 8 in 2012, and were still supporting Windows without it into 2020, so they’re still relatively “new” for this platform. Nonetheless, they’re now mature enough that we can count on them being available.

I’d like to support both GCC-ish (via Mingw-w64) and MSVC-ish compilers. Mingw-w64 provides a compatible intrin.h, so I can stick to MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t define atomics for int (or even int32_t), strictly long, so I have to sneak in a little cast. (Recall: sizeof(long) == sizeof(int) on every version of Windows supporting futexes.) The other option is to typedef the WaitGroup so that it’s int on Linux (for the futex) and long on Windows (for atomics).

static int load(int *p)
{
    return _InterlockedOr((long *)p, 0);
}

static int addfetch(int *p, int addend)
{
    return addend + _InterlockedExchangeAdd((long *)p, addend);
}

The official, sanctioned futex functions are WaitOnAddress and WakeByAddressAll. They used to be in kernel32.dll, but as of this writing they live in API-MS-Win-Core-Synch-l1-2-0.dll, linked via -lsynchronization. Gross. Since I can’t stomach this, I instead call the low-level RTL functions where it’s actually implemented: RtlWaitOnAddress and RtlWakeAddressAll. These live in the nice neighborhood of ntdll.dll. They’re undocumented as far as I can tell, but thankfully Wine comes to the rescue, providing both documentation and several different implementations. Reading through it is educational, and hints at ways to construct futexes on systems lacking them.

These functions aren’t declared in any headers, so I have to do it myself. On the plus side, so far I haven’t paid the substantial compile-time costs of including windows.h, and so I can continue avoiding it. These functions are listed in the ntdll.dll import library, so I don’t need to invent the import library entries.

__declspec(dllimport)
long __stdcall RtlWaitOnAddress(void *, void *, size_t, void *);
__declspec(dllimport)
long __stdcall RtlWakeAddressAll(void *);

Rather conveniently, the semantics perfectly line up with Linux futexes!

static void wait(int *p, int current)
{
    RtlWaitOnAddress(p, &current, sizeof(*p), 0);
}

static void wake(int *p)
{
    RtlWakeAddressAll(p);
}

Like with Linux, there’s no meaningful failure, so the return values don’t matter.

That’s the whole implementation. Considering just a single platform, a flexible, lightweight, and easy-to-use synchronization facility in ~50 lines of relatively simple code is a pretty good deal if you ask me!

Illuminating synchronization edges for ThreadSanitizer

2022-10-03T03:09:38Z

Sanitizers are powerful development tools which complement debuggers and fuzzing. I typically have at least one sanitizer active during development. They’re particularly useful during code review, where they can identify issues before I’ve even begun examining the code carefully — sometimes in mere minutes under fuzzing. Accordingly, it’s a good idea to have your own code in good agreement with sanitizers before review. For ThreadSanitizer (TSan), that means dealing with false positives in programs relying on synchronization invisible to TSan.

This article’s motivation is multi-threaded epoll. I mitigate TSan false positives each time it comes up, enough to have gotten the hang of it, so I ought to document it. On Windows I would also run into the same issue with the Win32 message queue, crossing the synchronization edge between PostMessage (release) and GetMessage (acquire), except for the general lack of TSan support in Windows tooling. The same technique would work there as well.

My typical epoll scenario looks like so:

Create an epoll file descriptor (epoll_create1).
Create worker threads, passing the epoll file descriptor.
Worker threads loop on epoll_wait.
Main thread loops on accept, adding sockets to epoll (epoll_ctl).

Between accept and EPOLL_CTL_ADD, the main thread allocates and initializes the client session state, then attaches it to the epoll event. The client socket is added with the EPOLLONESHOT flag, and the session state is not touched after the call to epoll_ctl (note: sans error checks):

for (;;) {
    int fd = accept(...);
    struct session *session = ...;
    session->fd = fd;
    // ...
    struct epoll_event;
    event.events = EPOLLET | EPOLLONESHOT | ...;
    event.events.data.ptr = session;
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
}

In this example, struct session is defined by the application to contain all the state for handling a session (file descriptor, buffers, state machine, parser state, allocation arena, etc.). Everything else is part of the epoll interface.

When a socket is ready, one of the worker threads receive it. Due to EPOLLONESHOT, it’s immediately disabled and no other thread can receive it. The thread does as much work as possible (i.e. read/write until EAGAIN), then reactivates it with epoll_ctl:

for (;;) {
    struct epoll_event event;
    epoll_wait(epfd, &event, 1, -1);
    struct session *session = event.data.ptr;
    int fd = session->fd;
    // ...
    event.events = EPOLLET | EPOLLONESHOT | ...;
    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
}

The shared variables in session are passed between threads through epoll using the event’s .user.ptr. These variables are potentially read and mutated by every thread, but it’s all perfectly safe without any further synchronization — i.e. no need for mutexes, etc. All the necessary synchronization is implicit in epoll.

In the initial hand-off, that EPOLL_CTL_ADD must happen before the corresponding epoll_wait in a worker thread. This establishes that the main thread and worker thread do not touch session variables concurrently. After all, how could the worker see an event on the file descriptor before it’s been added to epoll? The synchronization in epoll itself will also ensure all the architecture-level stores are visible to other threads before the hand-off. We can call the “add” a release and the “wait” an acquire, forming a synchronization edge.

Similarly, in the hand-off between worker threads, the EPOLL_CTL_MOD that reactivates the file descriptor must happen before the wait that observes the next event because, until reactivation, it’s disabled. The EPOLL_CTL_MOD is another release in relation to the acquire wait.

Unfortunately TSan won’t see things this way. It can’t see into the kernel, and it doesn’t know these subtle epoll semantics, so it can’t see these synchronization edges. As far as it can tell, threads might be accessing a session concurrently, and TSan will reliably produce warnings about it. You could shrug your shoulders and give up on using TSan in this case, but there’s an easy solution: introduce redundant, semantically identical synchronization edges, but only when TSan is looking.

WARNING: ThreadSanitizer: data race

Redundant synchronization

I prefer to solve this by introducing the weakest possible synchronization so that I’m not synchronizing beyond epoll’s semantics. This will help TSan catch real mistakes that stronger synchronization might hide.

The weakest option is memory fences. These wouldn’t introduce extra loads or stores. At most it would be a fence instruction. I would use GCC’s built-in __atomic_thread_fence for the job. However, TSan does not currently understand thread fences, so that defeats the purpose. Instead, I introduce a new field to struct session:

struct session {
    int fd;
    // ...
    int _sync;
};

Then just before epoll_ctl I’ll do a release store on this field, “releasing” the session. All session stores are ordered before the release.

    // main thread
    // ...
    __atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);

    // worker thread
    // ...
    __atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);

After epoll_wait I add an acquire load, “acquiring” the session. All session loads are ordered after the acquire.

    epoll_wait(epfd, &event, 1, -1);
    struct session *session = event.data.ptr;
    __atomic_load_n(&session->_sync, __ATOMIC_ACQUIRE)
    int fd = session->fd;
    // ...

For this to work, the thread must not touch session variables in any way before the acquire or after the release. For example, note how I obtained the client file descriptor before the release, i.e. no session->fd argument in the epoll_ctl call.

That’s it! This redundantly establishes the happens before relationship already implicit in epoll, but now it’s visible to TSan. However, I don’t want to pay for this unless I’m actually running under TSan, so some macros are in order. __SANITIZE_THREAD__ is automatically defined when running under TSan:

#if __SANITIZE_THREAD__
# define TSAN_SYNCED     int _sync
# define TSAN_ACQUIRE(s) __atomic_load_n(&(s)->_sync, __ATOMIC_ACQUIRE)
# define TSAN_RELEASE(s) __atomic_store_n(&(s)->_sync, 0, __ATOMIC_RELEASE)
#else
# define TSAN_SYNCED
# define TSAN_ACQUIRE(s)
# define TSAN_RELEASE(s)
#endif

This also makes it more readable, and intentions clearer:

struct session {
    int fd;
    // ...
    TSAN_SYNCED;
};

    // main thread
    for (;;) {
        // ...
        TSAN_RELEASE(session);
        epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
    }

    // worker thread
    for (;;) {
        epoll_wait(epfd, &event, 1, -1);
        struct session *session = event.data.ptr;
        TSAN_ACQUIRE(session);
        int fd = session->fd;
        // ...
        TSAN_RELEASE(session);
        epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
    }

Now I can use TSan again, and it didn’t cost anything in normal builds.

The quick and practical "MSI" hash table

2022-08-08T23:57:08Z

Follow-up: Solving “Two Sum” in C with a tiny hash table

I generally prefer C, so I’m accustomed to building whatever I need on the fly, such as heaps, linked lists, and especially hash tables. Few programs use more than a small subset of a data structure’s features, making their implementation smaller, simpler, and more efficient than the general case, which must handle every edge case. A typical hash table tutorial will describe a relatively lengthy program, but in practice, bespoke hash tables are only a few lines of code. Over the years I’ve worked out some basic principles for hash table construction that aid in quick and efficient implementation. This article covers the technique and philosophy behind what I’ve come to call the “mask-step-index” (MSI) hash table, which is my standard approach.

MSI hash tables are nothing novel, just a double hashed, open address hash table layered generically atop an external array. It’s best regarded as a kind of database index — a lookup index over an existing array. The array exists independently, and the hash table provides an efficient lookup into that array over some property of its entries.

The core of the MSI hash table is this iterator function:

// Compute the next candidate index. Initialize idx to the hash.
int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (64 - exp)) | 1;
    return (idx + step) & mask;
}

The name should now make sense. I literally sound it out in my head when I type it, like a mnemonic. Compute a mask, then a step size, finally an index. The exp parameter is a power-of-two exponent for the hash table size, which may look familiar. I’ve used int32_t for the index, but it’s easy to substitute, say, size_t. I try to optimize for the common case, where a 31-bit index is more than sufficient, and a signed type since subscripts should be signed. Internally it uses unsigned types since overflow is both expected and harmless thanks to the power-of-two hash table size.

It’s the caller’s responsibility to compute the hash, and the MSI iterator tells the caller where to look next. For insertion, the caller (maybe) looks either for an existing entry to override, or an empty slot. For lookup, the caller looks for a matching entry, giving up as soon as it find an empty slot. An insertion loop looks like this string intern table:

#define EXP 15

// Initialize all slots to an "empty" value (null)
#define HT_INIT { {0}, 0 }
struct ht {
    char *ht[1<<EXP];
    int32_t len;
};

char *intern(struct ht *t, char *key)
{
    uint64_t h = hash(key, strlen(key)+1);
    for (int32_t i = h;;) {
        i = ht_lookup(h, EXP, i);
        if (!t->ht[i]) {
            // empty, insert here
            if ((uint32_t)t->len+1 == (uint32_t)1<<EXP) {
                return 0;  // out of memory
            }
            t->len++;
            t->ht[i] = key;
            return key;
        } else if (!strcmp(t->ht[i], key)) {
            // found, return canonical instance
            return t->ht[i];
        }
    }
}

The caller initializes the iterator to the hash result. This will probably be out of range, even negative, but that doesn’t matter. The iterator function will turn it into a valid index before use. This detail is key to double hashing: The low bits of the hash tell it where to start, and the high bits tell it how to step. The hash table size is a power of two, and the step size is forced to an odd number (via | 1), so it’s guaranteed to visit each slot in the table exactly once before restarting. It’s important that the search halts before looping, such as by guaranteeing the existence of an empty slot (i.e. the “out of memory” check).

Note: The example out of memory check pushes the hash table to the absolute limit, and in practice you’d want to stop at a smaller load factor — perhaps even as low as 50% since that’s simple and fast. Otherwise it degrades into a linear search as the table approaches capacity.

Even if two keys start or land at the same place, they’ll quickly diverge due to differing steps. For awhile I used plain linear probing — i.e. step=1 — but double hashing came out ahead every time I benchmarked, steering me towards this “MSI” construction. Ideally ht_lookup would be placed so that it’s inlined — e.g. in the same translation unit — so that the mask and step are not actually recomputed each iteration.

Deletion

What about deletion? First, consider how infrequently you delete entries from a hash table. When was the last time you used del on a dictionary in Python, or delete on a map in Go? This operation is rarely needed. However, when you do need it, reserve a gravestone value in addition to the empty value.

static char gravestone[] = "(deleted)";

char *intern(struct ht *t, char *key)
{
    char **dest = 0;
    // ...
        if (!t->ht[i]) {
            // ...
            dest = dest ? dest : &t->ht[i];
            *dest = key;
            return key;
        } else if (t->ht[i] == gravestone) {
            dest = dest ? dest : &t->ht[i];
        } else if (!strcmp(...)) {
            // ...
        }
    // ...
}

char *unintern(struct ht *t, char *key)
{
    // ...
        if (!t->ht[i]) {
            return 0;
        } else if (t->ht[i] == gravestone) {
            // skip over
        } else if (!strcmp(...)) {
            char *old = t->ht[i];
            t->ht[i] = gravestone;
            return old;
        }
    // ...
}

When searching, skip over gravestones. Note that gravestones are compared with == (identity), so this does not preclude a string "(deleted)". When inserting, use the first gravestone found if no entry was found.

As a database index

Iterating over the example string intern table is simple: Iterate over the underlying array, skipping empty slots (and maybe gravestones). Entries will be in a random order rather than, say, insertion order. This is a useful introductory example, but this isn’t where MSI most shines. As mentioned, it’s best when treated like a database index.

Let’s take a step back and consider the caller of intern. How does it allocate these strings? Perhaps they’re appended to a buffer, and intern indicates whether or not the string is unique so far.

struct buf {
    // lookup table over the buffer
    struct ht ht;

    // a collection of strings
    int32_t len;
    char buf[BUFLEN];
};

Strings are only appended to the buffer when unique, and the hash table can make that determination in constant time.

char *buf_push(struct buf *b, char *s)
{
    size_t len = strlen(s) + 1;
    if (b->len+len > sizeof(b->buf)) {
        return 0;  // out of memory
    }

    char *candidate = b->buf + buf->len;
    memcpy(candidate, s, len);

    char *result = intern(&b->ht, candidate);
    if (result == candidate) {
        // string is unique, keep it
        b->len += len;
    }
    return result;
}

In my first example, EXP was fixed. This could be converted into a dynamic allocation and the hash table resized as needed. Here’s a new constructor, which I’m including since I think it’s instructive:

struct ht {
    int32_t len;
    int exp;
    char **ht;
};

static struct ht
ht_new(int exp)
{
    struct ht ht = {0, exp, 0};

    assert(exp >= 0);
    if (exp >= 32) {
        return ht;  // request too large
    }

    ht.ht = calloc((size_t)1<<exp, sizeof(ht.ht[0]));
    return ht;
}

If intern fails, the hash table can be replaced with a new table twice as large, and since, like a database index, its contents are entirely redundant, the hash table can be discarded and rebuilt from scratch. The new and old table don’t need to exist simultaneously. Here’s a routine to populate an empty hash table from the buffer:

void buf_rehash(struct buf *b)
{
    assert(b->ht.len == 0);
    for (int32_t off = 0; off < b->len;) {
        char *s = b->buf + off;
        int32_t len = strlen(s) + 1;
        off += len;
        uint64_t h = hash(s, len);
        for (int32_t i = h;;) {
            i = ht_lookup(h, b->ht.exp, i);
            if (!b->ht.ht[i]) {
                b->ht.len++;
                b->ht.ht[i] = s;
                break;
            }
        }
    }
}

Note how this iterates in insertion order, which may be useful in other cases, too. On the rehash it doesn’t need to check for existing entries, as all entries are already known to be unique. Later when intern hits its capacity:

    char *result = intern(&b->ht, candidate);
    if (!result) {
        free(b->ht.ht);
        b->ht = ht_new(ht.exp+1);
        if (!b->ht) {
            return 0;  // out of memory
        }
        buf_rehash(b);
        result = intern(&b->ht, candidate);  // cannot fail
    }

I freed and reallocated the table, but it would be trivial to use a realloc instead, unlike the case where the old table isn’t redundant.

Multimaps

An MSI hash table is trivially converted into a multimap, a hash table with multiple values per key. Callers just make one small change: Don’t stop searching until an empty slot is found. Each match is an additional multimap value. The “value array” is stored along the hash table itself, in insertion order, without additional allocations.

For example, imagine the strings in the string buffer have a namespace prefix, delimited by a colon, like city:Austin and state:Texas. We’d like a fast lookup of all strings under a particular namespace. The solution is to add another hash table as you would an index to a database table.

struct buf {
    // ..
    struct ht ns;
    // ..
};

When a unique string is appended it’s also registered in the namespace multimap. It doesn’t check for an existing key, only for an empty slot, since it’s a multimap:

    // Check outside the loop since it always inserts.
    if (/* ... ns multimap lacks capacity ... */) {
        // ... grow+rehash ns mutilmap ...
    }

    int32_t nslen = strcspn(s, ":") + 1;
    uint64_t h = hash(s, nslen);
    for (int32_t i = h;;) {
        i = ht_lookup(h, b->ns.exp, i);
        if (!b->ns.ht[i]) {
            b->ns.len++;
            b->ns.ht[i] = s;
            break;
        }
    }

It includes the : as a terminator which simplifies lookups. Here’s a lookup loop to print all strings under a namespace (includes terminal : in the key):

    char *ns = "city:";
    int32_t nslen = strlen(ns);
    // ...

    uint64_t h = hash(ns, nslen);
    for (int32_t i = h;;) {
        i = ht_lookup(h, b->ns.exp, i);
        if (!b->ns.ht[i]) {
            break;
        } else if (!strncmp(b.ns->ht[i], ns, nslen)) {
            puts(b->ns.ht[i]+nslen);
        }
    }

An alternative approach to multimaps is to additionally key over a value subscript. For example, the first city is keyed {"city", 0}, the next {"city", 1}, etc. The value subscript could be mixed into the string hash with an integer permutation (more on this below):

uint64_t h = hash64(val_idx ^ hash(s, nslen));

The lookup loop would compare both the string and the value subscript, and stop when it finds a match. The underlying hash table is not truly a multimap, but rather a plain hash table with a larger key. This requires extra bookkeeping — tracking individual subscripts and the number of values per key — but provides constant time random access on the multimap value array.

Hash functions

The MSI iterator leaves hashing up to the caller, who has better knowledge about the input and how to hash it, though this takes a bit of knowledge of how to build a hash function. The good news is that it’s easy, and less is more. Better to do too little than too much, and a faster, weaker hash function is worth a few extra collisions.

The first rule is to never lose sight of the goal: The purpose of the hash function is to uniformly distribute entries over a table. The better you know and exploit your input, the less you need to do in the hash function. Sometimes your keys already contain random data, and so your hash function can be the identity function! For example, if your keys are “version 4” UUIDs, don’t waste time hashing them, just load a few bytes from the end as an integer and you’re done.

// "Hash" a v4 UUID
uint64_t uuid4_hash(unsigned char uuid[16])
{
    uint64_t h;
    memcpy(&h, uuid+8, 8);
    return h;
}

A reasonable start for strings is FNV-1a, such as this possible implementation for my hash() function above:

uint64_t hash(char *s, int32_t len)
{
    uint64_t h = 0x100;
    for (int32_t i = 0; i < len; i++) {
        h ^= s[i] & 255;
        h *= 1111111111111111111;
    }
    return h ^ h>>32;
}

The hash state is initialized to a basis, some arbitrary value. This a useful place to introduce a seed or hash key. It’s best that at least one bit above the low mix-in bits is set so that it’s not trivially stuck at zero. Above, I’ve chosen the most trivial basis with reasonable results, though often I’ll use the digits of π.

Next XOR some input into the low bits. This could be a byte, a Unicode code point, etc. More is better, since otherwise you’re stuck doing more work per unit, the main weakness of FNV-1a. Carefully note the byte mask, & 255, which inhibits sign extension. Do not mix sign-extended inputs into FNV-1a — a widespread implementation mistake.

Multiply by a large, odd random-ish integer. A prime is a reasonable choice, and I usually pick my favorite prime, shown above: 19 ones in base 10.

Finally, my own touch, an xorshift finalizer. The high bits are much better mixed than the low bits, so this improves the overall quality. Though if you take time to benchmark, you might find that this finalizer isn’t necessary. Remember, do just enough work to keep the number of collisions low — not lowest — and no more.

If your input is made of integers, or is a short, fixed length, use an integer permutation, particularly multiply-xorshift. It takes very little to get a sufficient distribution. Sometimes one multiplication does the trick. Fixed-sized, integer-permutation hashes tend to be the fastest, easily beating fancier SIMD-based hashes, including AES-NI. For example:

// Hash a timestamp-based, version 1 UUID
uint64_t uuid1_hash(unsigned char uuid[16])
{
    uint64_t s[2];
    memcpy(s, uuid, 16);
    s[0] += 0x3243f6a8885a308d;  // digits of pi
    s[0] *= 1111111111111111111;
    s[0] ^= s[0] >> 33;
    s[0] += s[1];
    s[0] *= 1111111111111111111;
    s[0] ^= s[0] >> 33;
    return s[0];
}

If I benchmarked this in a real program, I would probably cut it down even further, deleting hash operations one at a time and measuring the overall hash table performance. This memcpy trick works well with floats, too, especially packing two single precision floats into one 64-bit integer.

If you ever hesitate to build a hash table when the situation calls, I hope the MSI technique will make the difference next time. I have more hash table tricks up my sleeve, but since they’re not specific to MSI I’ll save them for a future article.

Benchmarks

There have been objections to my claims about performance, so I’ve assembled some benchmarks. These demonstrate that:

AES-NI slower than an integer permutation, at least for short keys.
A custom, 10-line MSI hash table is easily an order of magnitude faster than a typical generic hash table from your language’s standard library. This isn’t because the standard hash table is inferior, but because it wasn’t written for your specific problem.

My new debugbreak command

2022-07-31T12:59:59Z

I previously mentioned the Windows feature where pressing F12 in a debuggee window causes it to break in the debugger. It works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the hotkey simply raises a breakpoint structured exception. It’s been surprisingly useful, and I’ve wanted it available in more contexts, such as console programs or even on Linux. The result is a new debugbreak command, now included in w64devkit. Though, of course, you already have everything you need to build it and try it out right now. I’ve also worked out a Linux implementation.

It’s named after an MSVC intrinsic and Win32 function. It takes no arguments, and its operation is indiscriminate: It raises a breakpoint exception in all debuggee processes system-wide. Reckless? Perhaps, but certainly convenient. You don’t need to tell it which process you want to pause. It just works, and a good debugging experience is one of ease and convenience.

The linchpin is DebugBreakProcess. The command walks the process list and fires this function at each process. Nothing happens for programs without a debugger attached, so it doesn’t even bother checking if it’s a debuggee. It couldn’t be simpler. I’ve used it on everything from Windows XP to Windows 11, and it’s worked flawlessly.

HANDLE s = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
PROCESSENTRY32W p = {sizeof(p)};
for (BOOL r = Process32FirstW(s, &p); r; r = Process32NextW(s, &p)) {
    HANDLE h = OpenProcess(PROCESS_ALL_ACCESS, 0, p.th32ProcessID);
    if (h) {
        DebugBreakProcess(h);
        CloseHandle(h);
    }
}

I use it almost exclusively from Vim, where I’ve given it a leader mapping. With the editor focused, I can type backslash then d to pause the debuggee.

map <leader>d :call system("debugbreak")<cr>

With the debuggee paused, I’m free to add new breakpoints or watchpoints, or print the call stack to see what the heck it’s busy doing. The mechanism behind DebugBreakProcess is to create a new thread in the target, with that thread raising the breakpoint exception. The debugger will be stopped in this new thread. In GDB you can use the thread command to switch over to the thread that actually matters, usually thr 1.

debugbreak on Linux

On unix-like systems the equivalent of a breakpoint exception is a SIGTRAP. There’s already a standard command for sending signals, kill, so a debugbreak command can be built using nothing more than a few lines of shell script. However, unlike DebugBreakProcess, signaling every process with SIGTRAP will only end in tears. The script will need a way to determine which processes are debuggees.

Linux exposes processes in the file system as virtual files under /proc, where each process appears as a directory. Its status file includes a TracerPid field, which will be non-zero for debuggees. The script inspects this field, and if non-zero sends a SIGTRAP.

#!/bin/sh
set -e
for pid in $(find /proc -maxdepth 1 -printf '%f\n' | grep '^[0-9]\+$'); do
    grep -q '^TracerPid:\s[^0]' /proc/$pid/status 2>/dev/null &&
        kill -TRAP $pid
done

This script, now part of my dotfiles, has worked very well so far, and effectively smoothes over some debugging differences between Windows and Linux, reducing my context switching mental load. There’s probably a better way to express this script, but that’s the best I could do so far. On the BSDs you’d need to parse the output of ps, though each system seems to do its own thing for distinguishing debuggees.

A missing feature

I had originally planned for one flag, -k. Rather than breakpoint debugees, it would terminate all debuggee processes. This is especially important on Windows where debuggee processes block builds due to file locking shenanigans. I’d just run debugbreak -k as part of the build. However, it’s not possible to terminate debuggees paused in the debugger — the common situation. I’ve given up on this for now.

Assertions should be more debugger-oriented

2022-06-26T18:51:04Z

Prompted by a 20 minute video, over the past month I’ve improved my debugger skills. I’d shamefully acquired a bad habit: avoiding a debugger until exhausting dumber, insufficient methods. My first choice should be a debugger, but I had allowed a bit of friction to dissuade me. With some thoughtful practice and deliberate effort clearing the path, my bad habit is finally broken — at least when a good debugger is available. It feels like I’ve leveled up and, like touch typing, this was a skill I’d neglected far too long. One friction point was the less-than-optimal assert feature in basically every programming language implementation. It ought to work better with debuggers.

An assertion verifies a program invariant, and so if one fails then there’s undoubtedly a defect in the program. In other words, assertions make programs more sensitive to defects, allowing problems to be caught more quickly and accurately. Counter-intuitively, crashing early and often makes for more robust and reliable software in the long run. For exactly this reason, assertions go especially well with fuzzing.

assert(i >= 0 && i < len);   // bounds check
assert((ssize_t)size >= 0);  // suspicious size_t
assert(cur->next != cur);    // circular reference?

They’re sometimes abused for error handling, which is a reason they’ve also been (wrongfully) discouraged at times. For example, failing to open a file is an error, not a defect, so an assertion is inappropriate.

Normal programs have implicit assertions all over, even if we don’t usually think of them as assertions. In some cases they’re checked by the hardware. Examples of implicit assertion failures:

Out-of-bounds indexing
Dereferencing null/nil/None
Dividing by zero
Certain kinds of integer overflow (e.g. -ftrapv)

Programs are generally not intended to recover from these situations because, had they been anticipated, the invalid operation wouldn’t have been attempted in the first place. The program simply crashes because there’s no better alternative. Sanitizers, including Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan), are in essence additional, implicit assertions, checking invariants that aren’t normally checked.

Ideally a failing assertion should have these two effects:

Execution should immediately stop. The program is in an unknown state, so it’s neither safe to “clean up” nor attempt to recover. Additional execution will only make debugging more difficult, and may obscure the defect.
When run under a debugger — or visited as a core dump — it should break exactly at the failed assertion, ready for inspection. I should not need to dig around the call stack to figure out where the failure occurred. I certainly shouldn’t need to manually set a breakpoint and restart the program hoping to fail the assertion a second time. The whole reason for using a debugger is to save time, so if it’s wasting my time then it’s failing at its primary job.

I examined standard assert features across various language implementations, and none strictly meet the criteria. Fortunately, in some cases, it’s trivial to build a better assertion, and you can substitute your own definition. First, let’s discuss the way assertions disappoint.

A test assertion

My test for C and C++ is minimal but establishes some state and gives me a variable to inspect:

#include 

int main(void)
{
    for (int i = 0; i < 10; i++) {
        assert(i < 5);
    }
}

Then I compile and debug in the most straightforward way:

$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt

The r in GDB stands for run, which immediately breaks because of the assert. The bt prints a backtrace. On a typical Linux distribution that shows this backtrace:

#0  __GI_raise
#1  __GI_abort
#2  __assert_fail_base
#3  __GI___assert_fail
#4  main

Well, actually, it’s much messier than this, but I manually cleaned it up:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1  0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2  0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i < 5", file=0x555555556004 "test.c", line=6, f
unction=) at assert.c:92
#3  0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i < 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 <__PRETTY_FUNCTION__.0> "main") at assert
.c:101
#4  0x0000555555555178 in main () at test.c:6

That’s a lot to take in at a glance, and about 95% of it is noise that will never contain useful information. Most notably, GDB didn’t stop at the failing assertion. Instead there’s four stack frames of libc junk I have to navigate before I can even begin debugging.

(gdb) up
(gdb) up
(gdb) up
(gdb) up

I must wade through this for every assertion failure. This is some of the friction that made me avoid the debugger in the first place. glibc loves indirection, so maybe the other libc implementations do better? How about musl?

#0  setjmp
#1  raise
#2  ??
#3  ??
#4  ??
#5  ??
#6  ??
#7  ??
#8  ??
#9  ??
#10 ??
#11 ??

Oops, without musl debugging symbols I can’t debug assertions at all because GDB can’t read the stack, so it’s lost. If you’re on Alpine you can install musl-dbg, but otherwise you’ll probably need to build your own from source. With debugging symbols, musl is no better than glibc:

#0  __restore_sigs
#1  raise
#2  abort
#3  __assert_fail
#4  main

Same with FreeBSD:

#0  thr_kill
#1  in raise
#2  in abort
#3  __assert
#4  main

OpenBSD has one fewer frame:

#0  thrkill
#1  _libc_abort
#2  _libc___assert2
#3  main

How about on Windows with Mingw-w64?

[Inferior 1 (process 7864) exited with code 03]

Oops, on Windows GDB doesn’t break at all on assert. You must first set a breakpoint on abort:

(gdb) b abort

Besides that, it’s the most straightforward so far:

#0 msvcrt!abort
#1 msvcrt!_assert
#2 main

With MSVC (default CRT) I get something slightly different:

#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh

RemedyBG leaves me at the abort like GDB does elsewhere. Visual Studio recognizes that I don’t care about its stack frames and instead puts the focus on the assertion, ready for debugging. The other stack frames are there, but basically invisible. It’s the only case that practically meets all my criteria!

I can’t entirely blame these implementations. The C standard requires that assert print a diagnostic and call abort, and that abort raises SIGABRT. There’s not much implementations can do, and it’s up to the debugger to be smarter about it.

Sanitizers

ASan doesn’t break GDB on assertion failures, which is yet another source of friction. You can work around this with an environment variable:

export ASAN_OPTIONS=abort_on_error=1:print_legend=0

This works, but it’s the worst case of all: I get 7 junk stack frames on top of the failed assertion. It’s also very noisy when it traps, so the print_legend=0 helps to cut it down a bit. I want this variable so often that I set it in my shell’s .profile so that it’s always set.

With UBSan you can use -fsanitize-undefined-trap-on-error, which behaves like the improved assertion. It traps directly on the defect with no junk frames, though it prints no diagnostic. As a bonus, it also means you don’t need to link libubsan. Thanks to the bonus, it fully supplants -ftrapv for me on all platforms.

Update November 2022: This “stop” hook eliminates ASan friction by popping runtime frames — functions with the reserved __ prefix — from the call stack so that they’re not in the way when GDB takes control. It requires Python support, which is the purpose of the feature-sniff outer condition.

if !$_isvoid($_any_caller_matches)
    define hook-stop
        while $_thread && $_any_caller_matches("^__")
            up-silently
        end
    end
end

This is now part of my .gdbinit.

A better assertion

At least when under a debugger, here’s a much better assertion macro for GCC and Clang:

#define assert(c) if (!(c)) __builtin_trap()

__builtin_trap inserts a trap instruction — a built-in breakpoint. By not calling a function to raise a signal, there are no junk stack frames and no need to breakpoint on abort. It stops exactly where it should as quickly as possible. This definition works reliably with GCC across all platforms, too. On MSVC the equivalent is __debugbreak. If you’re really in a pinch then do whatever it takes to trigger a fault, like dereferencing a null pointer. A more complete definition might be:

#ifdef DEBUG
#  if __GNUC__
#    define assert(c) if (!(c)) __builtin_trap()
#  elif _MSC_VER
#    define assert(c) if (!(c)) __debugbreak()
#  else
#    define assert(c) if (!(c)) *(volatile int *)0 = 0
#  endif
#else
#  define assert(c)
#endif

None of these print a diagnostic, but that’s unnecessary when a debugger is involved.

Other languages

Unfortunately the situation mostly gets worse with other language implementations, and it’s generally not possible to build a better assertion. Assertions typically have exception-like semantics, if not literally just another exception, and so they are far less reliable. If a failed assertion raises an exception, then the program won’t stop until it’s unwound the stack — running destructors and such along the way — all the way to the top level looking for a handler. It only knows there’s a problem when nobody was there to catch it.

Go officially doesn’t have assertions, though panics are a kind of assertion. However, panics have exception-like semantics, and so suffer the problems of exceptions. A Go version of my test:

func main() {
    defer fmt.Println("DEFER")
    for i := 0; i < 10; i++ {
        if i >= 5 {
            panic(i)
        }
    }
}

If I run this under Go’s premier debugger, Delve, the unrecovered panic causes it to break. So far so good. However, I get two junk frames:

#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit

It only knows to stop because the Go runtime called fatalpanic, but the backtrace is a fiction: The program continued to run after the panic, enough to run all the registered defers (including printing “DEFER”), unwinding the stack to the top level, and only then did it fatalpanic. Fortunately it’s still possible to inspect all those stack frames even if some variables may have changed while unwinding, but it’s more like inspecting a core dump than a paused process.

The situation in Python is similar: assert raises AssertionError — a plain old exception — and pdb won’t break until the stack has unwound, exiting context managers and such. Only once the exception reaches the top level does it enter “post mortem debugging,” like a core dump. At least there are no junk stack frames on top. If you’re using asyncio then your program may continue running for quite awhile before the right tasks are scheduled and the exception finally propagates to the top level, if ever.

The worst offender of all is Java. First jdb never breaks for unhandled exceptions. It’s up to you to set a breakpoint before the exception is thrown. But it gets worse: assertions are disabled under jdb. The Java assert statement is worse than useless.

Addendum: Don’t exit the debugger

The largest friction-reducing change I made is never exiting the debugger. Previously I would enter GDB, run my program, exit, edit/rebuild, repeat. However, there’s no reason to exit GDB! It automatically and reliably reloads symbols and updates breakpoints on symbols. It remembers your run configuration, so re-running is just r rather than interacting with shell history.

My workflow on all platforms (including Windows) is a vertically maximized Vim window and a vertically maximized terminal window. The new part for me: The terminal runs a long-term GDB session exclusively, with file set to the program I’m writing, usually set by initial the command line.

$ gdb myprogram
gdb>

Alternatively use file after starting GDB. Occasionally useful if my project has multiple binaries, and I want to examine a different program.

gdb> file myprogram

I use make and Vim’s :mak command for building from within the editor, so I don’t need to change context to build. The quickfix list takes me straight to warnings/errors. Often I’m writing something that takes input from standard input. So I use the run (r) command to set this up (along with any command line arguments).

gdb> r



You can redirect standard output as well. It remembers these settings for
plain run later, so I can test my program by entering r and nothing
else.

gdb> r


My usual workflow is edit, :mak, r, repeat. If I want to test a
different input or use different options, change the run configuration
using run again:

gdb> r -a -b -c 


On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use kill (k) to stop it
without exiting GDB.

gdb> k


GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
set confirm no in your .gdbinit to disable it.

Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with pkill, which
GDB will catch (except on Windows):

:!pkill myprogram


I suspect many people don’t know this, but if you’re on Windows and
developing a graphical application, you can press F12 in the
debuggee’s window to immediately break the program in the attached
debugger. This is a general platform feature and works with any native
debugger. I’ve been using it quite a lot.

On that note, you can run commands from GDB with !, which is another way
to avoid having an extra terminal window around:

gdb> !git diff


In any case, GDB will re-read the binary on the next run and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.

gdb> b somefunc
gdb> r


Alternatively I’ll use a line number, which I read from Vim. Though GDB,
not being involved in the editing process, cannot track how that line
moves between builds.

An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type next (n) — or step (s) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:

gdb> n
gdb>
gdb>


(I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful list command
is inadequate. Update: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either tty or set new-console. I can access it everywhere since
w64devkit now supports GDB TUI.)

If I want to advance to the next breakpoint, I use continue (c):

gdb> c


If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep printing (p) the same variables again and again.
So I use display (disp) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is i over some string str, this will show me the current
character in character format (/c).

gdb> disp/c str[i]


You can accumulate multiple expressions. Use undisplay to remove them.

Too many breakpoints? Use info breakpoints (i b) to list them, then
delete (d) the unwanted ones by ID.

gdb> i b
gdb> d 3 5 8


GDB has many more feature than this, but 10 commands cover 99% of use
cases: r, c, n, s, disp, k, b, i, d, p.



My take on "where's all the code"
2022-05-22T23:59:59Z
This article was discussed on Lobsters.

Earlier this month Ted Unangst researched compiling the OpenBSD kernel
50% faster, which involved stubbing out the largest, extraneous
branches of the source tree. To find the lowest-hanging fruit, he wrote a
tool called watc — where’s all the code — that displays an
interactive “usage” summary of a source tree oriented around line count. A
followup post about exploring the tree in parallel got me thinking
about the problem, especially since I had just written about a concurrent
queue. Turning it over in my mind, I saw opportunities for interesting
data structures and memory management, and so I wanted to write my own
version of the tool, watc.c, which is the subject of this
article.



The original watc is interactive and written in idiomatic Go. My version
is non-interactive, written in C, and currently only supports Windows. Not
only do I prefer batch programs generally, building an interactive user
interface would be complicated and distract from the actual problem I
wanted to tackle. As for the platform restriction, it has some convenient
constraints (for implementers), and my projects are often about shooting
multiple birds with one stone:


  
    The longest path is MAX_PATH, a meager 260 pseudo-UTF-16 code points,
is nice and short. Technically users can now opt-in to a maximum path
length of 32,767, but so little software supports it, including
much of Windows itself, that it’s not worth considering. Even with the
upper limit, each path component is still restricted by MAX_PATH. I
can rely on this platform restriction in my design.
  
  
    Symbolic links, an annoying edge case, are outside of consideration.
Technically Windows has them, but they’re sufficiently locked away that
they don’t come up in practice.
  
  
    After years of deliberating, I was finally convinced to buy and
try RememdyBG, a super slick Windows debugger. I especially wanted
to try out its multi-threading support, and I knew I’d be using multiple
threads in this project. Since it’s incompatible with my development
kit, my program also supports the MSVC compiler.
  
  
    The very same day I improved GDB support in my development kit,
and this was a great opportunity to dogfood the changes. I’ve used my
kit so much these past two years, especially since both it and I have
matured enough that I’m nearly as productive in it as I am on Linux.
  
  
    It’s practice and experience with the wide API, and the tool
fully supports Unicode paths. Perhaps a bit unnecessary considering how
few source trees stray beyond ASCII, even just in source text — just too
many ways things go wrong otherwise.
  


Running my tool on nearly the same source tree as the original example
yields:

C:\openbsd>watc sys
. 6.89MLOC 364.58MiB
├─dev 5.69MLOC 332.75MiB
│ ├─pci 4.46MLOC 293.80MiB
│ │ ├─drm 3.99MLOC 280.25MiB
│ │ │ ├─amd 3.33MLOC 261.24MiB
│ │ │ │ ├─include 2.61MLOC 238.48MiB
│ │ │ │ │ ├─asic_reg 2.53MLOC 235.07MiB
│ │ │ │ │ │ ├─nbio 689.56kLOC 69.33MiB
│ │ │ │ │ │ ├─dcn 583.67kLOC 58.60MiB
│ │ │ │ │ │ ├─gc 290.26kLOC 28.90MiB
│ │ │ │ │ │ ├─dce 210.16kLOC 16.81MiB
│ │ │ │ │ │ ├─mmhub 155.60kLOC 16.03MiB
│ │ │ │ │ │ ├─dpcs 123.90kLOC 12.97MiB
│ │ │ │ │ │ ├─gca 105.91kLOC 5.87MiB
│ │ │ │ │ │ ├─bif 71.45kLOC 4.41MiB
│ │ │ │ │ │ ├─gmc 64.24kLOC 3.41MiB
│ │ │ │ │ │ └─(other) 230.99kLOC 18.73MiB
│ │ │ │ │ └─(other) 2.10kLOC 139.29kiB
│ │ │ │ └─(other) 718.93kLOC 22.76MiB
│ │ │ └─(other) 583.63kLOC 16.86MiB
│ │ └─(other) 8.53kLOC 259.07kiB
│ └─(other) 1.20MLOC 38.34MiB
└─(other) 1.20MLOC 31.83MiB


In place of interactivity it has -n (lines) and -d (depth) switches to
control tree pruning, where branches are summarized as (other) entries.
My idea is for users to run the tool repeatedly with different cutoffs and
filters to get a feel for where’s all the code. (It could really use
more such knobs.) Repeated counting makes performance all the more
important. On my machine, and a hot cache, the above takes ~180ms to count
those 6.89 million lines of code across 8,607 source files.

Each directory is treated like one big source file of its recursively
concatenated contents, so the tool only needs to track directories. Each
directory entry comprises a variable-length string name, line and byte
totals, and tree linkage such that it can be later navigated for sorting
and printing. That linkage has a clever solution, which I’ll get to later.
First, lets deal with strings.

String management

It’s important to get out of the null-terminated string business early,
only reverting to their use at system boundaries, such as constructing
paths for the operating system. Better to handle strings as offset/length
pairs into a buffer. Definitely avoid silly things like allocating many
individual strings, as encouraged by strdup — and most other
programming language idioms — and certainly avoid useless functions like
strcpy.

When the operating system provides a path component that I need to track
for later, I intern it into a single, large buffer. That buffer looks like
so:

#define BUF_MAX  (1 << 22)
struct buf {
    int32_t len;
    wchar_t buf[BUF_MAX];
};


Empirically I determined that even large source trees cumulatively total
on the order of 10,000 characters of directory names. The OpenBSD kernel
source tree is only 2,992 characters of names.

$ find sys -type d -printf %f | wc -c
2992


The biggest I found was the LLVM source tree at 121,720 characters, not
only because of its sheer volume but also because it has generally has
relatively long names. So for my maximum buffer size I just maxed it out
(explained in a moment) and called it good. Even with UTF-16, that’s only
8MiB which is perfectly reasonable to allocate all at once up front. Since
my string handles don’t contain pointers, this buffer could be freely
relocated in the case of realloc.

The operating system provides a null-terminated string. The buffer makes a
copy and returns a handle. A handle is a 32-bit integer encoding offset
and length.

int32_t buf_push(struct buf *b, wchar_t *s)
{
    int32_t off = b->len;
    int32_t len = wcslen(s);
    if (b->len+len > BUF_MAX) {
        return -1;  // out of memory
    }
    memcpy(b->buf+off, s, len*sizeof(*s));
    b->len += len;
    return len<<22 | off;
}


The negative range is reserved for errors, leaving 31 bits. I allocate 9
to the length — enough for MAX_PATH of 260 — and the remaining 22 bits
for the buffer offset, exactly matching the range of my BUF_MAX.
Splitting on a nibble boundary would have displayed more nicely in
hexadecimal during debugging, but oh well.

A couple of helper functions are in order:

int     str_len(int32_t s) { return s >> 22;      }
int32_t str_off(int32_t s) { return s & 0x3fffff; }


Rather than allocate the string buffer on the heap, it’s a static (read:
too big for the stack) scoped to main. I consistently call it b.

static struct buf b;


That’s string management solved efficiently in a dozen lines of code. I
briefly considered a hash table to de-duplicate strings in the buffer, but
real source trees aren’t redundant enough to make up for the hash table
itself, plus there’s no reason here to make that sort of time/memory
trade-off.

Directory entries

I settled on 24-byte directory entries:

struct dir {
    uint64_t nbytes;
    uint32_t nlines;
    int32_t  name;
    int32_t  link;
    int32_t  nsubdirs;
};


For nbytes I teetered between 32 bits and 64 bits for the byte count. No
source tree I found overflows an unsigned 32-bit integer, but LLVM comes
close, just barely overflowing a signed 31-bit integer as of this year.
Since I wanted 10x over the worst case I could find, that left me with a
64-bit integer for bytes.

For nlines, 32 bits has plenty of overhead. More importantly, this field
is updated concurrently and atomically by multiple threads — line counting
is parallelized — and I want this program to work on 32-bit hosts limited
to 32-bit atomics.

The name is the string handle for that directory’s name.

The link and nsubdirs is the tree linkage. The link field is an
index, and serves two different purposes at different times. Initially it
will identify the directory’s parent directory, and I had originally named
it parent. nsubdirs is the number of subdirectories, but there is
initially no link to a directory’s children.

Like with the buffer, I pre-allocate all the directory entries I’ll need:

#define DIRS_MAX  (1 << 17)
int32_t ndirs = 0;
static struct dir dirs[DIRS_MAX];


A directory handle is just an index into dirs. The link field is one
such handle. Like string handles, directory entries contain no pointers,
and so this dirs buffer could be freely relocated, a la realloc, if
the context called for such flexibility. In my program, rather than
allocate this on the heap, it’s just a static (read: too big for the
stack) scoped to main.

For DIRS_MAX, I again looked at the worst case I could find, LLVM, which
requires 12,163 entries. I had hoped for 16-bit directory handles, but
that would limit source trees to 32,768 directories — not quite 10x over
the worst case. I settled on 131,072 entries: 3MiB. At only 11MiB total so
far, in the very worst case, it hardly matters that I couldn’t shave off
these extra few bytes.

$ find llvm-project -type d | wc -l
12163


Allocating a directory entry is just a matter of bumping the ndirs
counter. Reading a directory into dirs looks roughly like so:

int32_t glob = buf_push(&b, L"*");
static struct dir dirs[DIRS_MAX];

int32_t parent = ...;  // an existing directory handle
wchar_t path[MAX_PATH];
buildpath(path, &b, dirs, parent, glob);

WIN32_FIND_DATAW fd;
HANDLE h = FindFirstFileW(path, &fd);

do {
    if (FILE_ATTRIBUTE_DIRECTORY & fd.dwFileAttributes) {
        int32_t name = buf_push(&b, fd.cFileName);
        if (name < 0 || ndirs == DIRS_MAX) {
            // out of memory
        }
        int32_t i = ndirs++;
        dirs[i].name = name;
        dirs[i].link = parent;
        dirs[parent].nsubdirs++;
    } else {
        // ... process file ...
    }
} while (FindNextFileW(h, &fd));

CloseHandle(h);


Mentally bookmark that “process file” part. It will be addressed later.

The buildpath function walks the link fields, copying (memcpy) path
components from the string buffer into the path, separated by
backslashes.

Breadth-first tree traversal

At the top-level the program must first traverse a tree. There are two
strategies for traversing a tree (or any graph):


  Depth-first: stack-oriented (lends to recursion)
  Breadth-first: queue-oriented


Recursion makes me nervous, but besides this, a queue is already a natural
fit for this problem. The tree I build in dirs is also the breadth-first
processing queue. (Note: This is entirely distinct from the message
queue that I’ll introduce later, and is not a concurrent queue.) Further,
building the tree in dirs via breadth-first traversal will have useful
properties later.

The queue is initialized with the root directory, then iterated over until
the iterator reaches the end. Additional directories may added during
iteration, per the last section.

int32_t root = ndirs++;
dirs[root].name = buf_push(&b, L".");
dirs[root].link = -1;  // terminator

for (int32_t parent = 0; parent < ndirs; parent++) {
    // ... FindFirstFileW / FindNextFileW ...
}


When the loop exits, the program has traversed the full tree. Counts are
now propagated up the tree using the link field, pointing from leaves to
root. In this direction it’s just a linked list. Propagation starts at the
root and works towards leaves to avoid multiple-counting, and the
breadth-first dirs is already ordered for this.

for (int32_t i = 1; i < ndirs; i++) {
    for (int32_t j = dirs[i].link; j >= 0; j = dirs[j].link) {
        dirs[j].nbytes += dirs[i].nbytes;
        dirs[j].nlines += dirs[i].nlines;
    }
}


Since this is really another traversal, this could be done during the
first traversal. However, line counting will be done concurrently, and
it’s easier, and probably more efficient, to propagate concurrent results
after the concurrent part of the code is complete.

Inverting the tree links

Printing the graph will require a depth-first traversal. Given an entry,
the program will iterate over its children. However, the tree links are
currently backwards, pointing from child to parent:



To traverse from root to leaves, those links will need to be inverted:



However, there’s only one link on each node, but potentially multiple
children. The breadth-first traversal comes to the rescue: All child nodes
for a given directory are adjacent in dirs. If link points to the
first child, finding the rest is trivial. There’s an implicit link between
siblings by virtue of position:



An entry’s first child immediately follows the previous entry’s last
child. So to flip the links around, manually establish the root’s link
field, then walk the tree breadth-first and hook link up to each entry’s
children based on the previous entry’s link and nsubdirs:

dirs[0].link = 1;
for (int32_t i = 1; i < ndirs; i++) {
    dirs[i].link = dirs[i-1].link + dirs[i-1].nsubdirs;
}


The tree is now restructured for sorting and depth-first traversal.

Sort by line count

I won’t include it here, but I have a qsort-compatible comparison
function, dircmp that compares by line count descending, then by name
ascending. As a file system tree, siblings cannot have equal names.

int dircmp(const void *, const void *);


Since child entries are adjacent, it’s a trivial to qsort each entry’s
children. A loop sorts the whole tree:

for (int32_t i = 0; i < ndirs; i++) {
    struct dir *beg = dirs + dirs[i].link;
    qsort(beg, dirs[i].nsubdirs, sizeof(*dirs), dircmp);
}


We’re almost to the finish line.

Depth-first traversal

As I said, recursion makes me nervous, so I took the slightly more
complicated route of an explicit stack. Path components must be separated
by a backslash delimiter, so the deepest possible stack is MAX_PATH/2.
Each stack element tracks a directory handle (d) and a subdirectory
index (i).

I have a printstat to output an entry. It takes an entry, the string
buffer, and a depth for indentation level.

void printstat(struct dir *d, struct buf *b, int depth);


Here’s a simplified depth-first traversal calling printstat. (The real
one has to make decisions about when to stop and summarize, and it’s
dominated by edge cases.) I initialize the stack with the root directory,
then loop until it’s empty.

int n = 0;  // top of stack
struct {
    int32_t d;
    int32_t i;
} stack[MAX_PATH/2];

stack[n].d = 0;
stack[n].i = 0;
printstat(dirs+0, &b, n);

while (n >= 0) {
    int32_t d = stack[n].d;
    int32_t i = stack[n].i++;
    if (i >= dirs[d].nsubdirs) {
        n--;  // pop
    } else {
        int32_t cur = dirs[d].link + i;
        printstat(dirs+cur, &b, n);
        n++;  // push
        stack[n].d = cur;
        stack[n].i = 0;
    }
}


Concurrency

At this point the “process file” part of traversal was a straightforward
CreateFile, ReadFile loop, CloseHandle. I suspected it spent most of
its time in the loop counting newlines since I didn’t do anything special,
like SIMD, aside from not over-constraining code
generation.

However after taking some measurements, I found the program was spending
99.9% its time waiting on Win32 functions. CreateFile was the most
expensive at nearly 50% of the total run time, and even CloseHandle was
a substantial blocker. These two alone meant overlapped I/O wouldn’t help
much, and threads were necessary to run these Win32 blockers concurrently.
Counting newlines, even over gigabytes of data, was practically free, and
so required no further attention.

So I set up my lock-free work queue.

#define QUEUE_LEN (1<<15)
struct queue {
    uint32_t q;
    int32_t d[QUEUE_LEN];
    int32_t f[QUEUE_LEN];
};


As before, q here is the atomic. A max-size queue for QUEUE_LEN worked
best in my tests. Larger queues were rarely full. Or empty, except at
startup and shutdown. Queue elements are a pair of directory handle (d)
and file string handle (f), stored in separate arrays.

I didn’t need to push the file name strings into the string buffer before,
but now it’s a great way to supply strings to other threads. I push the
string into the buffer, then send the handle through the queue. The
recipient re-constructs the path on its end using the directory tree and
this file name. Unfortunately this puts more stress on the string buffer,
which is why I had to max out the size, but it’s worth it.

The “process files” part now looks like this:

dirs[parent].nbytes += fd.nFileSizeLow;
dirs[parent].nbytes += (uint64_t)fd.nFileSizeHigh << 32;

int32_t name = buf_push(&b, fd.cFileName);
if (!queue_send(&queue, parent, name)) {
    wchar_t path[MAX_PATH];
    buildpath(path, buf.buf, dirs, parent, name);
    processfile(path, dirs, parent);
}


If queue_send() returns false then the queue is full, so it processes
the job itself. There might be room later for the next file.

Worker threads look similar, spinning until an item arrives in the queue:

    for (;;) {
        int32_t d;
        int32_t name;
        while (!queue_recv(q, &d, &name));
        if (d == -1) {
            return 0;
        }
        wchar_t path[MAX_PATH];
        buildpath(path, buf, dirs, d, name);
        processfile(path, dirs, d);
    }


A special directory entry handle of -1 tells the worker to exit. When
traversal completes, the main thread becomes a worker until the queue
empties, pushes one termination handle for each worker thread, then joins
the worker threads — a synchronization point that indicates all work is
complete, and the main thread can move on to propagation and sorting.

This was a substantial performance boost. At least on my system, running
just 4 threads total is enough to saturate the Win32 interface, and
additional threads do not make the program faster despite more available
cores.

Aside from overall portability, I’m quite happy with the results.




A lock-free, concurrent, generic queue in 32 bits
2022-05-14T04:22:24Z
This article was discussed on Hacker News.

While considering concurrent queue design I came up with a generic,
lock-free queue that fits in a 32-bit integer. The queue is “generic” in
that a single implementation supports elements of any arbitrary type,
despite an implementation in C. It’s lock-free in that there is guaranteed
system-wide progress. It can store up to 32,767 elements at a time — more
than enough for message queues, which must always be bounded. I
will first present a single-consumer, single-producer queue, then expand
support to multiple consumers at a cost. Like my lightweight barrier,
I’m not presenting this as a packaged solution, but rather as a technique
you can apply when circumstances call.



How can the queue store so many elements when it’s just 32 bits? It only
handles the indexes of a circular buffer. The caller is responsible
for allocating and manipulating the queue’s storage, which, in the
single-consumer case, doesn’t require anything fancy. Synchronization is
managed by the queue.

Like a typical circular buffer, it has a head index and a tail index. The
head is the next element to be pushed, and the tail is the next element to
be popped. The queue storage must have a power-of-two length, but the
capacity is one less than the length. If the head and tail are equal then
the queue is empty. This “wastes” one element, which is why the capacity
is one less than the length of the storage. So already there are some
notable constraints imposed by this design, but I believe the main use
case for such a queue — a job queue for CPU-bound jobs — has no problem
with these constraints.

Since this is a concurrent queue it’s worth noting “ownership” of storage
elements. The consumer owns elements from the tail up to, but excluding,
the head. The producer owns everything else. Both pushing and popping
involve a “commit” step that transfers ownership of an element to the
other thread. No elements are accessed concurrently, which makes things
easy for either caller.

Queue usage

Pushing (to the front) and popping (from the back) are each a three-step
process:


  Obtain the element index
  Access that element
  Commit the operation


I’ll be using C11 atomics for my implementation, but it should be easy to
translate these into something else no matter the programming language. As
I mentioned, the queue fits in a 32-bit integer, and so it’s represented
by an _Atomic uint32_t. Here’s the entire interface:

int  queue_pop(_Atomic uint32_t *queue, int exp);
void queue_pop_commit(_Atomic uint32_t *queue);

int  queue_push(_Atomic uint32_t *queue, int exp);
void queue_push_commit(_Atomic uint32_t *queue);


Both queue_pop and queue_push return -1 if the queue is empty/full.

To create a queue, initialize an atomic 32-bit integer to zero. Also
choose a size exponent and allocate some storage. Here’s a 63-element
queue of jobs:

#define EXP 6  // note; 2**6 == 64
struct job slots[1<<EXP];
_Atomic uint32_t q = 0;


Rather than a length, the queue functions accept a base-2 exponent, which
is why I’ve defined EXP. If you don’t like this, you can just accept a
length in your own implementation, though remember it’s constrained to
powers of two. The producer might look like so:

for (;;) {
    int i;
    do {
        i = queue_push(&q, EXP);
    } while (i < 0);  // note: busy-wait while full
    slots[i] = job_create();
    queue_push_commit(&q);
}


This is a busy-wait loop, which makes for a simple illustration but isn’t
ideal. In a real program I’d have the producer run a job while it
waits for a queue slot, or just have it turn into a consumer (if this
wasn’t a single-consumer queue). Similarly, if the queue is empty, then
maybe a consumer turns into the producer. It all depends on the context.

The consumer might look like so:

for (;;) {
    int i;
    do {
        i = queue_pop(&q, EXP);
    } while (i < 0);  // note: busy-wait while empty
    struct job job = slots[i];
    queue_pop_commit(&q);
    job_run(job);
}


In either case it’s important that neither touches the element after
committing since that transfers ownership away.

Pop operation

The queue is actually a pair of 16-bit integers, head and tail, each
stored in the low and high halves of the 32-bit integer. So the first
thing to do is atomically load the integer, then extract these “fields.”

If for some reason a capacity of 32,767 is insufficient, you can trivially
upgrade your queue to an Enterprise Queue: a 64-bit integer with a
capacity of over 2 billion elements. I’m going to stick with the 32-bit
queue.

Starting with the pop operation since it’s simpler:

int queue_pop(_Atomic uint32_t *q, int exp)
{
    uint32_t r = *q;  // consider "acquire"
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    return head == tail ? -1 : tail;
}


If the indexes are equal, the queue is empty. Otherwise return the tail
field. The *q is an atomic load since it’s qualified _Atomic. The load
might be more efficient if this were an explicit “acquire” operation,
which is what I used in some of my tests.

To complete the pop, atomically increment the tail index so that the
element falls out of the range of elements owned by the consumer. The tail
is the high half of the integer so add 0x10000 rather than just 1.

void queue_pop_commit(_Atomic uint32_t *q)
{
    *q += 0x10000;  // consider "release"
}


It’s harmless if this overflows since it’s congruent with the power-of-two
storage length, and an overflow won’t affect the head index. The increment
might be more efficient if this were an explicit “release” operation,
which, again, is what I used in some of my tests.

Push operation

Pushing is a little more complex. As is typical with circular buffers,
before doing anything it must ensure the result won’t ambiguously create
an empty queue.

int queue_push(_Atomic uint32_t *q, int exp)
{
    uint32_t r = *q;  // consider "acquire"
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    int next = (head + 1u) & mask;
    if (r & 0x8000) {  // avoid overflow on commit
        *q &= ~0x8000;
    }
    return next == tail ? -1 : head;
}


It’s important that incrementing the head field won’t overflow into the
tail field, so it atomically clears the high bit if set, giving the
increment overhead into which it can overflow.

void queue_push_commit(_Atomic uint32_t *q)
{
    *q += 1;  // consider "release"
}


Multiple-consumers

The single producer and single consumer didn’t require locks nor atomic
accesses to the storage array since the queue guaranteed that accesses at
the specified index were not concurrent. However, this is not the case
with multiple-consumers. Consumers race when popping. The loser’s access
might occur after the winner’s commit, making its access concurrent with
the producer. Both producer and consumers must account for this.

_Atomic struct job slots[1<<EXP];


To prepare for multiple consumers, the array now has an atomic qualifier:
one of the costs of multiple consumers. Fortunately these new atomic
accesses can use a “relaxed” ordering since there are no required ordering
constraints. Even if it wasn’t atomic, and the load was torn, we’d
detect it when attempting to commit. It’s simply against the rules to have
a data race, and I don’t know how else to avoid it other than dropping
into assembly.

The next cost is that committing can fail. Another consumer might have won
the race, which means you must start over. Here’s my multiple-consumer
interface, which I’ve uncreatively called mpop (“multiple-consumer
pop”). Besides a _Bool for indicating failure, the main change is a new
save parameter:

int   queue_mpop(_Atomic uint32_t *, int, uint32_t *save);
_Bool queue_mpop_commit(_Atomic uint32_t *, uint32_t save);


The caller must carry some temporary state (save), which is how failures
are detected, ultimately communicated by that _Bool return.

for (;;) {
    int i;
    int32_t save;
    struct job job;
    do {
        do {
            i = queue_mpop(&q, EXP, &save);
        } while (i < 0);  // note: busy-wait while empty
        job = slots[i];
    } while (!queue_mpop_commit(&q, save));
    job_run(job);
}


It’s important that the consumer doesn’t attempt to use job until a
successful commit, since it might not be valid. As noted, that load could
be relaxed (what a mouthful):

job = atomic_load_explicit(slots+i, memory_order_relaxed);


Here’s the pop implementation:

int queue_mpop(_Atomic uint32_t *q, int exp, uint32_t *save)
{
    uint32_t r = *save = *q;
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    return head == tail ? -1 : tail;
}


So far it’s exactly the same, except it stores a full snapshot of the
queue state in *save. This is needed for a compare-and-swap (CAS) in the
commit, which checks that the queue hasn’t been modified concurrently
(i.e. by another consumer):

_Bool queue_mpop_commit(_Atomic uint32_t *q, uint32_t save)
{
    return atomic_compare_exchange_strong(q, &save, save+0x10000);
}


As always with CAS, we must be wary of the ABA problem. Imagine
that between starting to pop and this CAS that the producer and another
consumer looped over the entire queue and ended up back at exactly the
same spot as where we started. The queue would look like we expect, and
the commit would “succeed” despite reading a garbage value.

Fortunately this matches the entire 32-bit state, and so a small queue
capacity is not at a greater risk. The tail counter is always 16 bits, and
the head counter is 15 bits (due to keeping the 16th clear for overflow).
The chance of them landing at exactly the same count is low. Though if
those odds aren’t low enough, as mentioned you can always upgrade to the
64-bit Enterprise Queue with larger counters.

There’s a notable performance defect with this particular design. If the
producer concurrently pushes a new value, the commit will fail even if
there was no real race since only the head field changed. It would be
better if the head field was isolated from the tail field…

A less cheeky design

You might have noticed that there’s little reason to pack two 16-bit
counters into a 32-bit integer. These could just be fields in a structure:

struct queue {
    _Atomic uint16_t head;
    _Atomic uint16_t tail;
};


While this entire structure can be atomically loaded just like the 32-bit
integer, C11 (and later) do not permit non-atomic accesses to these atomic
fields in an unshared copy loaded from an atomic. So I’d either use
compiler-specific built-ins for atomics — much more flexible, and what I
prefer anyway — or just load them individually:

int queue_pop(struct queue *q, int exp, uint16_t *save)
{
    int mask = (1u << exp) - 1;
    int head = q->head & mask;
    int tail = (*save = q->tail) & mask;
    return head == tail ? -1 : tail;
}


Technically with two loads this could extract a head/tail pair that
were never contemporaneous. The worst case is the queue appears empty even
if it was never actually empty.

_Bool queue_mpop_commit(struct queue *q, uint16_t save)
{
    return atomic_compare_exchange_strong(&q->tail, &save, save+1);
}


Since the head index isn’t part of the CAS, the producer can’t interfere
with the commit. (Though there’s still certainly false sharing happening.)

Real implementation and tests

If you want to try it out, especially with my tests: queue.c.
It has both single-consumer and multiple-consumer queues, and supports at
least:


  atomics: C11, GNU, MSC
  threads: pthreads, win32
  compilers: GCC, Clang, MSC
  hosts: Linux, Windows, BSD


Since I wanted to test across a variety of implementations, especially
under Thread Sanitizer (TSan). On a similar note, I also implemented a
concurrent queue shared between C and Go: queue.go.




Luhn algorithm using SWAR and SIMD
2022-04-30T17:53:05Z
Ever been so successful that credit card processing was your bottleneck?
Perhaps you’ve wondered, “If only I could compute check digits three times
faster using the same hardware!” Me neither. But if that ever happens
someday, then this article is for you. I will show how to compute the
Luhn algorithm in parallel using SIMD within a register, or
SWAR.

If you want to skip ahead, here’s the full source, tests, and benchmark:
luhn.c

The Luhn algorithm isn’t just for credit card numbers, but they do make a
nice target for a SWAR approach. The major payment processors use 16
digit numbers — i.e. 16 ASCII bytes — and typical machines today have
8-byte registers, so the input fits into two machine registers. In this
context, the algorithm works like so:


  
    Consider the digits number as an array, and double every other digit
starting with the first. For example, 6543 becomes 12, 5, 8, 3.
  
  
    Sum individual digits in each element. The example becomes 3 (i.e.
1+2), 5, 8, 3.
  
  
    Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.
  


I will implement this algorithm in C with this prototype:

int luhn(const char *s);


It assumes the input is 16 bytes and only contains digits, and it will
return the Luhn sum. Callers either validate a number by comparing the
result to zero, or use it to compute a check digit when generating a
number. (Read: You could use SWAR to rapidly generate valid numbers.)

The plan is to process the 16-digit number in two halves, and so first
load the halves into 64-bit registers, which I’m calling hi and lo:

uint64_t hi =
    (uint64_t)(s[ 0]&255) <<  0 | (uint64_t)(s[ 1]&255) <<  8 |
    (uint64_t)(s[ 2]&255) << 16 | (uint64_t)(s[ 3]&255) << 24 |
    (uint64_t)(s[ 4]&255) << 32 | (uint64_t)(s[ 5]&255) << 40 |
    (uint64_t)(s[ 6]&255) << 48 | (uint64_t)(s[ 7]&255) << 56;
uint64_t lo =
    (uint64_t)(s[ 8]&255) <<  0 | (uint64_t)(s[ 9]&255) <<  8 |
    (uint64_t)(s[10]&255) << 16 | (uint64_t)(s[11]&255) << 24 |
    (uint64_t)(s[12]&255) << 32 | (uint64_t)(s[13]&255) << 40 |
    (uint64_t)(s[14]&255) << 48 | (uint64_t)(s[15]&255) << 56;


This looks complicated and possibly expensive, but it’s really just an
idiom for loading a little endian 64-bit integer from a buffer. Breaking
it down:


  
    The input, *s, is char, which may be signed on some architectures. I
chose this type since it’s the natural type for strings. However, I do
not want sign extension, so I mask the low byte of the possibly-signed
result by ANDing with 255. It’s as though *s was unsigned char.
  
  
    The shifts assemble the 64-bit result in little endian byte order
regardless of the host machine byte order. In other words, this
will produce correct results even on big endian hosts.
  
  
    I chose little endian since it’s the natural byte order for all the
architectures I care about. Big endian hosts may pay a cost on this load
(byte swap instruction, etc.). The rest of the function could just as
easily be computed over a big endian load if I was primarily targeting a
big endian machine instead.
  
  
    I could have used unsigned long long (i.e. at least 64 bits) since
no part of this function requires exactly 64 bits. I chose uint64_t
since it’s succinct, and in practice, every implementation supporting
long long also defines uint64_t.
  


Both GCC and Clang figure this all out and produce perfect code. On
x86-64, just one instruction for each statement:

    mov  rax, [rdi+0]
    mov  rdx, [rdi+8]


Or, more impressively, loading both using a single instruction on ARM64:

    ldp  x0, x1, [x0]


The next step is to decode ASCII into numeric values. This is trivial and
common in SWAR, and only requires subtracting '0' (0x30). So long
as there is no overflow, this can be done lane-wise.

hi -= 0x3030303030303030;
lo -= 0x3030303030303030;


Each byte of the register now contains values in 0–9. Next, double every
other digit. Multiplication in SWAR is not easy, but doubling just means
adding the odd lanes to themselves. I can mask out the lanes that are not
doubled. Regarding the mask, recall that the least significant byte is the
first byte (little endian).

hi += hi & 0x00ff00ff00ff00ff;
lo += lo & 0x00ff00ff00ff00ff;


Each byte of the register now contains values in 0–18. Now for the tricky
problem of folding the tens place into the ones place. Unlike 8 or 16, 10
is not a particularly convenient base for computers, especially since SWAR
lacks lane-wide division or modulo. Perhaps a lane-wise binary-coded
decimal could solve this. However, I have a better trick up my
sleeve.

Consider that the tens place is either 0 or 1. In other words, we really
only care if the value in the lane is greater than 9. If I add 6 to each
lane, the 5th bit (value 16) will definitely be set in any lanes that were
previously at least 10. I can use that bit as the tens place.

hi += (hi + 0x0006000600060006)>>4 & 0x0001000100010001;
lo += (lo + 0x0006000600060006)>>4 & 0x0001000100010001;


This code adds 6 to the doubled lanes, shifts the 5th bit to the least
significant position in the lane, masks for just that bit, and adds it
lane-wise to the total. Only applying this to doubled lanes is a style
decision, and I could have applied it to all lanes for free.

The astute might notice I’ve strayed from the stated algorithm. A lane
that was holding, say, 12 now hold 13 rather than 3. Since the final
result of the algorithm is modulo 10, leaving the tens place alone is
harmless, so this is fine.

At this point each lane contains values in 0–19. Now that the tens
processing is done, I can combine the halves into one register with a
lane-wise sum:

hi += lo;


Each lane contains values in 0–38. I would have preferred to do this
sooner, but that would have complicated tens place handling. Even if I had
rotated the doubled lanes in one register to even out the sums, some lanes
may still have had a 2 in the tens place.

The final step is a horizontal sum reduction using the typical SWAR
approach. Add the top half of the register to the bottom half, then the
top half of what’s left to the bottom half, etc.

hi += hi >> 32;
hi += hi >> 16;
hi += hi >>  8;


Before the sum I said each lane was 0–38, so couldn’t this sum be as high
as 304 (8x38)? It would overflow the lane, giving an incorrect result.
Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled
lanes. That’s a maximum of 224, which fits in the result lane without
overflow. Whew! I’ve been tracking the range all along to guard against
overflow like this.

Finally mask the result lane and return it modulo 10:

return (hi&255) % 10;


On my machine, SWAR is around 3x faster than a straightforward
digit-by-digit implementation.

Usage examples

int is_valid(const char *s)
{
    return luhn(s) == 0;
}

void random_credit_card(char *s)
{
    sprintf(s, "%015llu0", rand64()%1000000000000000);
    s[15] = '0' + 10 - luhn(s);
}


SIMD

Conveniently, all the SWAR operations translate directly into SSE2
instructions. If you understand the SWAR version, then this is easy to
follow:

int luhn(const char *s)
{
    __m128i r = _mm_loadu_si128((void *)s);

    // decode ASCII
    r = _mm_sub_epi8(r, _mm_set1_epi8(0x30));

    // double every other digit
    __m128i m = _mm_set1_epi16(0x00ff);
    r = _mm_add_epi8(r, _mm_and_si128(r, m));

    // extract and add tens digit
    __m128i t = _mm_set1_epi16(0x0006);
    t = _mm_add_epi8(r, t);
    t = _mm_srai_epi32(t, 4);
    t = _mm_and_si128(t, _mm_set1_epi8(1));
    r = _mm_add_epi8(r, t);

    // horizontal sum
    r = _mm_sad_epu8(r, _mm_set1_epi32(0));
    r = _mm_add_epi32(r, _mm_shuffle_epi32(r, 2));
    return _mm_cvtsi128_si32(r) % 10;
}


On my machine, the SIMD version is around another 3x increase over SWAR,
and so nearly an order of magnitude faster than a digit-by-digit
implementation.

Update: Const-me on Hacker News suggests a better option for
handling the tens digit in the function above, shaving off 7% of the
function’s run time on my machine:

    // if (digit > 9) digit -= 9
    __m128i nine = _mm_set1_epi8(9);
    __m128i gt = _mm_cmpgt_epi8(r, nine);
    r = _mm_sub_epi8(r, _mm_and_si128(gt, nine));


Update: u/aqrit on reddit has come up with a more optimized SSE2
solution, 12% faster than mine on my machine:

int luhn(const char *s)
{
    __m128i v = _mm_loadu_si128((void *)s);
    __m128i m = _mm_cmpgt_epi8(_mm_set1_epi16('5'), v);
    v = _mm_add_epi8(v, _mm_slli_epi16(v, 8));
    v = _mm_add_epi8(v, m);  // subtract 1 if less than 5
    v = _mm_sad_epu8(v, _mm_setzero_si128());
    v = _mm_add_epi32(v, _mm_shuffle_epi32(v, 2));
    return (_mm_cvtsi128_si32(v) - 4) % 10;
    // (('0' * 24) - 8) % 10 == 4
}





A flexible, lightweight, spin-lock barrier
2022-03-13T23:55:08Z
This article was discussed on Hacker News.

The other day I wanted try the famous memory reordering experiment
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an “impossible” result on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.



Here’s the entire barrier implementation for two threads in C11.

// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
    uint32_t v = ++*barrier;
    if (v & 1) {
        for (v &= 2; (*barrier&2) == v;);
    }
}


Or in Go:

func BarrierWait(barrier *uint32) {
    v := atomic.AddUint32(barrier, 1)
    if v&1 == 1 {
        v &= 2
        for atomic.LoadUint32(barrier)&2 == v {
        }
    }
}


Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.

When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of subtly-incorrect spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.

Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.

Experiment

Here’s the setup for the memory reordering experiment, where w0 and w1
are initialized to zero.

thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0


Considering all the possible orderings, it would seem that at least one of
r0 or r1 is 1. There seems to be no ordering where r0 and r1 could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.

How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use volatile for w0 and w1. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
volatile.

So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:

static int experiment(int *w0, int *w1)
{
    int r1;
    __asm volatile (
        "movl  $1, %1\n"
        "movl  %2, %0\n"
        : "=r"(r1), "=m"(*w0)
        : "m"(*w1)
    );
    return r1;
}


ARM64 (to try on my Raspberry Pi):

static int experiment(int *w0, int *w1)
{
    int r1 = 1;
    __asm volatile (
        "str  %w0, %1\n"
        "ldr  %w0, %2\n"
        : "+r"(r1), "=m"(w0)
        : "m"(w1)
    );
    return r1;
}


This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
static.

Alternatively, I could use C11 atomics with a relaxed memory order:

static int experiment(_Atomic int *w0, _Atomic int *w1)
{
    atomic_store_explicit(w0, 1, memory_order_relaxed);
    return atomic_load_explicit(w1, memory_order_relaxed);
}


Since this is a race and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of starting barrier… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.

int w0, w1, r0, r1;

// thread#1                   // thread#2
w0 = w1 = 0;
BARRIER;                      BARRIER;
r1 = experiment(&w0, &w1);    r0 = experiment(&w1, &w0);
BARRIER;                      BARRIER;

if (!r0 && !r1) {
    puts("impossible!");
}


The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.

Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.

Barrier selection

On my first attempt, I made the obvious decision for the barrier: I used
pthread_barrier_t. I was already using pthreads for spawning the
extra thread, including on Windows, so this was convenient.

However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a global
lock twice per wait to manage the barrier’s reference counter.

All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a spin-lock barrier.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.

Barrier design

Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to w0, w1) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.

I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.

At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.

void broken_wait1(_Atomic unsigned *barrier)
{
    ++*barrier;
    while (*barrier&1);
}

Or to avoid an extra load, use the result directly:

void broken_wait2(_Atomic unsigned *barrier)
{
    if (++*barrier & 1) {
        while (*barrier&1);
    }
}


Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.

To fix this, the wait function must also track the phase. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently the rest of the integer acts like a phase counter!
Writing this out more explicitly:

void barrier_wait(_Atomic unsigned *barrier)
{
    unsigned observed = ++*barrier;
    unsigned thread_count = observed & 1;
    if (thread_count != 0) {
        // not last arrival, watch for phase change
        unsigned init_phase = observed >> 1;
        for (;;) {
            unsigned current_phase = *barrier >> 1;
            if (current_phase != init_phase) {
                break;
            }
        }
    }
}


The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.

By the way, I’m using unsigned since it may eventually overflow, and
even _Atomic int overflow is undefined for the ++ operator. However,
if you use atomic_fetch_add or C++ std::atomic then overflow is
defined and you can use int.

Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (>>), I
mask (&) the phase bit with 2.

With this spin-lock barrier, the experiment observes r0 = r1 = 0 in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.

Generalizing to more threads

Two threads required two bits. This generalizes to log2(n)+1 bits for
n threads, where n is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
    unsigned v = ++*barrier;
    if (v & (n - 1)) {
        for (v &= n; (*barrier&n) == v;);
    }
}


Note: It never makes sense for n to exceed the logical core count!
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.

If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a uint64_t — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the & operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.

While C11 _Atomic seems like it would be useful, unsurprisingly it is
not supported by one major, stubborn implementation. If you’re
using C++11 or later, then go ahead use std::atomic since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.

#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
    int v = BARRIER_INC(barrier);
    if (v & (n - 1)) {
        for (v &= n; (BARRIER_GET(barrier)&n) == v;);
    }
}


This has the nice bonus that the interface does not have the _Atomic
qualifier, nor std::atomic template. It’s just a plain old int, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.

If you’d like to try the experiment yourself: reorder.c. If
you’d like to see a test of Go and C sharing a thread barrier:
coop.go.

I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe context is
everything. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.




Compressing and embedding a Wordle word list
2022-03-07T03:22:41Z
Wordle is all the rage, resulting in an explosion of hobbyist clones,
with new ones appearing every day. At the current rate I estimate by the
end of 2022 that 99% of all new software releases will be Wordle clones.
That’s no surprise since the rules are simple, it’s more fun to implement
and study than to actually play, and the hard part is building a decent
user interface. Such implementations go back at least 30 years.
Implementers get to decide on a platform, language, and the particular
subject of this article: how to handle the word list. Is it a separate
file/database or embedded in the program? If embedded, is it
worth compressing? In this article I’ll present a simple, tailored Wordle
list compression strategy that beats general purpose compressors.

Last week one particular QuickBASIC clone, WorDOSle, caught my
eye. It embeds its word list despite the dire constraints of its 16-bit
platform. The original Wordle list (1, 2) has 12,972 words which,
naively stored, would consume 77,832 bytes (5 letters, plus newline).
Sadly this exceeds a 16-bit address space. Eliminating the redundant
newline delimiter brings it down to 64,860 bytes — just small enough to
fit in an 8086 segment, but probably still difficult to manage from
QuickBASIC.

The author made a trade-off, reducing the word list to a more manageable,
if meager, 2,318 words, wisely excluding delimiters. Otherwise no further
effort made towards reducing the size. The list is sorted, and the program
cleverly tests words against the list in place using a binary search.

Compaction baseline

Before getting into any real compression technologies, there’s low hanging
fruit to investigate. Words are exactly five, case-insensitive, English
language letters: a–z. To illustrate, here are the first 100 5-letter
words from a short Wordle word list.

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands


In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (log2(26)) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (26**5 is ~23.5 bits). A 40% savings just by using a
smarter representation.

With 12,972 words, that’s 38,916 bytes for the whole list. Any
compression I apply must at least beat this size in order to be worth
using.

Letter frequency

Not all letters occur at the same frequency. Here’s the letter frequency
for the original Wordle word list:

a:5990  e:6662  i:3759  m:1976  q: 112  u:2511  y:2074
b:1627  f:1115  j: 291  n:2952  r:4158  v: 694  z: 434
c:2028  g:1644  k:1505  o:4438  s:6665  w:1039
d:2453  h:1760  l:3371  p:2019  t:3295  x: 288


When encoding a word, I can save space by spending fewer bits on frequent
letters like e at the cost of spending more bits on infrequent letters
like q. There are multiple approaches, but the simplest is Huffman
coding. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.

While my ultimate target is C, I did the frequency analysis, explored the
problem space, and implemented my compressors in Python. I don’t normally
like to use Python, but it is good for one-shot, disposable data
science-y stuff like this. The decompressor will be implemented in C,
partially via meta-programming: Python code generating my C code. Here’s
my letter histogram code:

words = [line[:5] for line in sys.stdin]
hist = collections.defaultdict(int)
for c in itertools.chain(*words):
    hist[c] += 1


To build a Huffman coding tree, I’ll need a min-heap (priority queue)
initially filled with nodes representing each letter and its frequency.
While the heap has more than one element, I pop off the two lowest
frequency nodes, create a new parent node with the sum of their
frequencies, and push it into the heap. When the heap has one element, the
remaining element is the root of the Huffman coding tree.

def huffman(hist):
    heap = [(n, c) for c, n in hist.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        a, b = heapq.heappop(heap), heapq.heappop(heap)
        node = a[0]+b[0], (a[1], b[1])
        heapq.heappush(heap, node)
    return heap[0][1]

tree = huffman(hist)


(By the way, I love that heapq operates directly on a plain list
rather than being its own data structure.) This produces the following
Huffman coding tree (via pprint):

((('e', 's'),
  (('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
 ((('i', ('p', 'c')),
   ('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
  (('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))


It would be more useful to actually see the encodings.

def flatten(tree, prefix=""):
    if isinstance(tree, tuple):
        return flatten(tree[0], prefix+"0") + \
               flatten(tree[1], prefix+"1")
    else:
        return [(tree, prefix)]


I used isinstance to distinguish leaves (str) from internal nodes
(tuple). With sorted(flatten(tree)), I get something like Morse Code:

[('a', '1110'),       ('j', '10111110'),   ('s', '001'),
 ('b', '111111'),     ('k', '111110'),     ('t', '0100'),
 ('c', '10011'),      ('l', '0101'),       ('u', '11011'),
 ('d', '11010'),      ('m', '01111'),      ('v', '011010'),
 ('e', '000'),        ('n', '11110'),      ('w', '011011'),
 ('f', '101110'),     ('o', '1100'),       ('x', '101111111'),
 ('g', '01100'),      ('p', '10010'),      ('y', '10110'),
 ('h', '01110'),      ('q', '101111110'),  ('z', '1011110')]
 ('i', '1000'),       ('r', '1010'),


In terms of encoded bit length, what is the shortest and longest?

codes = dict(flatten(tree))
lengths = [(sum(len(codes[c]) for c in w), w) for w in words]


min(lengths) is “esses” at 15 bits, and max(lengths) is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: sum(w[0] for w in lengths)
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.

Speaking of bit packing, I’m ready to compress the entire word list into a
bit stream:

bits = "".join("".join(codes[c] for c in w) for w in words)


Where bits begins with:

11101110011100001101011101110010110001000111011101...


On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused textwrap to dice it up, and I also need to reverse each
set of bits before converting to an integer.

u32 = [int(b[::-1], 2) for b in textwrap.wrap(bits, width=32)]


I now have my compressed data as a sequence of 32-bit integers. Next, some
meta-programming:

print(f"static const uint32_t words[{len(u32)}] =", "{", end="")
for i, u in enumerate(u32):
    if i%6 == 0:
        print("\n    ", end="")
    print(f"0x{u:08x},", end="")
print("\n};")


That produces a C table, the beginnings of my decompressor. The array
length isn’t necessary since the C compiler can figure it out, but being
explicit allows human readers to know the size at a glance, too. Observe
how the final 32-bit integer isn’t entirely filled.

static const uint32_t words[8812] = {
    0x4eeb0e77,0xb8caee23,0xffb892bb,0x397fddf2,0xddfcbfee,0x5ff7997f,
    // ...
    0x7b4e66bd,0x35ebcccd,0x8f9af60f,0x0000000c,
};


Now, how to go about building the rest of the decompressor? I have a
Huffman coding tree, which is an awful lot like a state machine,
eh? I can even have Python generate a state transition table from the
Huffman tree:

def transitions(tree, states, state):
    if isinstance(tree, tuple):
        child = len(states)
        states[state] = -child
        states.extend((None, None))
        transitions(tree[0], states, child+0)
        transitions(tree[1], states, child+1)
    else:
        states[state] = ord(tree)
    return states

states = transitions(tree, [None], 0)


The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In transitions, the caller reserves
space in the state table for callees, hence starting with [None]. I’ll
show the actual table in C form after some more meta-programming:

print(f"static const int8_t states[{len(states)}] =", "{", end="")
for i, s in enumerate(states):
    if i%12 == 0:
        print("\n    ", end="")
    print(f"{s:4},", end="")
print("\n};")


I chose int8_t since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:

static const int8_t states[51] = {
      -1,  -3, -19,  -5,  -7, 101, 115,  -9, -11, 116, 108, -13,
     -17, 103, -15, 118, 119, 104, 109, -21, -39, -23, -27, 105,
     -25, 112,  99, 114, -29, 121, -31, 102, -33, 122, -35, 106,
     -37, 113, 120, -41, -45, 111, -43, 100, 117,  97, -47, 110,
     -49, 107,  98,
};


The first node is -1, meaning if you read a 0 bit then transition to state
1, else state 2 (e.g. immediately following 1). The decompressor reads one
bit at a time, walking the state table until it hits a positive value,
which is an ASCII code. I’ve decided on this function prototype:

int32_t next(char word[5], int32_t n);


The n is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:

int32_t next(char word[5], int32_t n)
{
    for (int i = 0; i < 5; i++) {
        int state = 0;
        for (; states[state] < 0; n++) {
            int b = words[n>>5]>>(n&31) & 1;  // next bit
            state = b - states[state];
        }
        word[i] = states[state];
    }
    return n;
}


When compiled, this is about 80 bytes of instructions, both x86-64 and
ARM64. This, along with the 51 bytes for the state table, should be
counted against the compression size. That’s 35,579 bytes total.

Trying it out, this program indeed reproduces the original word list:

int main(void)
{
    int32_t state = 0;
    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        state = next(word, state);
        fwrite(word, 6, 1, stdout);
    }
}


Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with b is at bit offset 15,743. If the
word I’m looking up begins with b then I can start there and stop at the
first c, decompressing just 909 words.

Taking it to the next level: run-length encoding

Here’s the 100-word word list sample again. The sorting is deliberate:

abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands


If I look at words column-wise, I see a long run of a, then a long run
of b, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)

For this second scheme, I’ll pivot the whole list so that I can encode it
in column-order. (This is roughly how one part of bzip2 works, by the
way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’,
etc.”, then I’ll encode these RLE tokens using Huffman coding, per the
first scheme, since there will be lots of repeated tokens.

First, pivot the word list:

pivot = "".join("".join(w[i] for w in words) for i in range(5))


Next compute the RLE token stream. The stream works in pairs, first
indicating a letter (1–26), then the run length.

tokens = []
offset = 0
while offset < len(pivot):
    c = pivot[offset]
    start = offset
    while offset < len(pivot) and pivot[offset] == c:
        offset += 1
    tokens.append(ord(c) - ord('a') + 1)
    tokens.append(offset - start)


I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 —
since I’m going to encode all the tokens using the same Huffman tree.
(Exercise for the reader: Does compression improve with two distinct
Huffman trees, one for letters and the other for runs?) There are no
zero-length runs, and I want there to be as few unique tokens as possible.

tokens looks like so (e.g. 737 ‘a’, 909 ‘b’, …):

[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]


The original Wordle list results in 139 unique tokens. A few tokens appear
many times, but most of appear only once. Reusing my Huffman coding tree
builder from before:

tree = huffman(collections.Counter(tokens))


This makes for a more complex and interesting tree:

(1,
 ((((18, 20), (25, (((10, 24), (26, 22)), 8))),
   (5,
    ((11,
      ((23,
        ((17,
          (((35, (46, 76)), ((82, 93), (104, 111))),
           (((165, 168), 27), (28, (((30, 39), 31), 38))))),
         ((((((40, 41), ((44, 48), 45)),
             ((53, (54, 56)), 55)),
            ((((57, 59), 58), ((60, 61), (62, 63))),
             ((64, (65, 66)), ((67, 70), 68)))),
           (((((71, 75), 74), (77, (78, 79))),
             (((80, 85), 87), 81)),
            ((((90, 91), (92, 97)), (96, (99, 100))),
             (((101, 103), 102),
              ((105, 106), (109, 110)))))),
          ((((((113, 114), 117), ((120, 121), (125, 129))),
             (((130, 133), (137, 139)), (138, (140, 142)))),
            ((((144, 145), (147, 153)), (148, (166, 175))),
             (((181, 183), (187, 189)),
              ((193, 202), (220, 242))))),
           (((((262, 303), (325, 376)),
              ((413, 489), (577, 598))),
             (((628, 638), (685, 693)),
              ((737, 815), (859, 909)))),
            ((((922, 1565), 29), 32), (34, (33, 43)))))))),
       6)),
     3))),
  ((19, 2),
   ((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))


Peeking at the first 21 elements of sorted(flatten(tree)), which chops
off the long tail of large-valued, single-occurrence tokens:

[(1, '0'),            (8, '100111'),       (15, '111010'),
 (2, '1101'),         (9, '111101'),       (16, '1110111'),
 (3, '10111'),        (10, '10011000'),    (17, '1011010100'),
 (4, '11100'),        (11, '101100'),      (18, '10000'),
 (5, '1010'),         (12, '111110'),      (19, '1100'),
 (6, '1011011'),      (13, '1111110'),     (20, '10001'),
 (7, '1111111'),      (14, '111100'),      (21, '1110110')]


Huffman-encoding the RLE stream is more straightforward:

codes = dict(flatten(tree))
bits = "".join(codes[token] for token in tokens)


This time len(bits) is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!

Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be int16_t. I copy-pasted the
original meta-programming code and make the appropriate adjustments:

static const int16_t states[277] = {
      -1,   1,  -3,  -5,-257,  -7, -21,  -9, -11,  18,  20,  25,
     -13, -15,   8, -17, -19,  10,  24,  26,  22,   5, -23, -25,
       3,  11, -27, -29,   6,  23, -31, -33, -63,  17, -35, -37,
     -49, -39, -43,  35, -41,  46,  76, -45, -47,  82,  93, 104,
     111, -51, -55, -53,  27, 165, 168,  28, -57, -59,  38, -61,
      31,  30,  39, -65,-155, -67,-109, -69, -85, -71, -79, -73,
     -75,  40,  41, -77,  45,  44,  48, -81,  55,  53, -83,  54,
      56, -87, -99, -89, -93, -91,  58,  57,  59, -95, -97,  60,
      61,  62,  63,-101,-105,  64,-103,  65,  66,-107,  68,  67,
      70,-111,-129,-113,-123,-115,-119,-117,  74,  71,  75,  77,
    -121,  78,  79,-125,  81,-127,  87,  80,  85,-131,-143,-133,
    -139,-135,-137,  90,  91,  92,  97,  96,-141,  99, 100,-145,
    -149,-147, 102, 101, 103,-151,-153, 105, 106, 109, 110,-157,
    -213,-159,-185,-161,-173,-163,-167,-165, 117, 113, 114,-169,
    -171, 120, 121, 125, 129,-175,-181,-177,-179, 130, 133, 137,
     139, 138,-183, 140, 142,-187,-199,-189,-195,-191,-193, 144,
     145, 147, 153, 148,-197, 166, 175,-201,-207,-203,-205, 181,
     183, 187, 189,-209,-211, 193, 202, 220, 242,-215,-245,-217,
    -231,-219,-225,-221,-223, 262, 303, 325, 376,-227,-229, 413,
     489, 577, 598,-233,-239,-235,-237, 628, 638, 685, 693,-241,
    -243, 737, 815, 859, 909,-247,-253,-249,  32,-251,  29, 922,
    1565,  34,-255,  33,  43,-259,-261,  19,   2,-263,-269,   4,
    -265,  15,-267,  21,  16,-271,-273,  14,   9,  12,-275,  13,
       7,
};


(Since 277 is prime it will never wrap to a nice rectangle no matter what
width I plug in. Ugh.)

With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a 12972*5-byte buffer to be
filled:

void decompress(char *);


Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a 12972*3-byte buffer.

Here’s my decoder, much like before:

void decompress(char *buf)
{
    for (int32_t x = 0, y = 0, i = 0; i < 164958;) {
        // Decode letter
        int state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int c = states[state] + 96;

        // Decode run-length
        state = 0;
        for (; states[state] < 0; i++) {
            int b = words[i>>5]>>(i&31) & 1;
            state = b - states[state];
        }
        int len = states[state];

        // Fill columns
        for (int n = 0; n < len; n++, y++) {
            buf[y*5+x] = c;
        }
        if (y == 12972) {
            y = 0;
            x++;
        }
    }
}


And my new test exactly reproduces the original list:

int main(void)
{
    char buf[12972*5L];
    decompress(buf);

    char word[] = ".....\n";
    for (int i = 0; i < 12972; i++) {
        memcpy(word, buf+i*5, 5);
        fwrite(word, 6, 1, stdout);
    }
}


Totalling it up:


  Compressed data is 20,620 bytes
  State table is 554 bytes
  Decompressor is about 200 bytes


That’s a total of 21,374 bytes. Surprisingly this beats general purpose
compressors!

PROGRAM     VERSION   SIZE
bzip2 -9    1.0.8     33,752
gzip -9     1.10      30,338
zstd -19    1.4.8     27,098
brotli -9   1.0.9     26,031
xz -9e      5.2.5     16,656
lzip -9     1.22      16,608


Only xz and lzip come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.

Update: Johannes Rudolph has pointed out a compression scheme for
a Game Boy Wordle clone last month that gets it down to 17,871 bytes,
and supports iteration. I improved on this scheme to further
reduce it to 16,659 bytes.




The wild west of Windows command line parsing
2022-02-18T03:52:12Z
I’ve been experimenting again lately with writing software without a
runtime aside from the operating system itself, both on Linux and
Windows. Another way to look at it: I write and embed a bespoke, minimal
runtime within the application. One of the runtime’s core jobs is
retrieving command line arguments from the operating system. On Windows
this is a deeper rabbit hole than I expected, and far more complex than I
realized. There is no standard, and every runtime does it a little
differently. Five different applications may see five different sets of
arguments — even different argument counts — from the same input, and this
is before any sort of option parsing. It’s truly a modern day Tower of
Babel: “Confound their command line parsing, that they may not understand
one another’s arguments.”

Unix-like systems pass the argv array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it argv. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):

_start: mov   edi, [rsp]     ; argc
        lea   rsi, [rsp+8]   ; argv
        call  main
        mov   edi, eax
        mov   eax, 60        ; SYS_exit
        syscall


It’s 5 instructions (20 bytes) on ARM64:

_start: ldr  w0, [sp]        ; argc
        add  x1, sp, 8       ; argv
        bl   main
        mov  w8, 93          ; SYS_exit
        svc  0


On Windows, argv is passed in serialized form as a string. That’s how
MS-DOS did it (via the Program Segment Prefix), because that’s how
CP/M did it. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and somebody has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.

Windows itself provides an operating system routine for parsing command
line strings: CommandLineToArgvW. Fetch the command line string
with GetCommandLineW, pass it to this function, and you have your
argc and argv. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so if you want to work in UTF-8 you’ll also need
WideCharToMultiByte. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.

My GetCommandLineW

GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the Process
Environment Block. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the Thread Information Block,
itself found via the gs register (x64, fs on x86), an old 386 segment
register. Buried in the PEB is a UNICODE_STRING, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:

wchar_t *cmdline_fetch(void)
{
    void *cmd = 0;
    #if __amd64
    __asm ("mov %%gs:(0x60), %0\n"
           "mov 0x20(%0), %0\n"
           "mov 0x78(%0), %0\n"
           : "=r"(cmd));
    #elif __i386
    __asm ("mov %%fs:(0x30), %0\n"
           "mov 0x10(%0), %0\n"
           "mov 0x44(%0), %0\n"
           : "=r"(cmd));
    #endif
    return cmd;
}


From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the gs register, and you’ll find it using these offsets, too.

If you look carefully at the UNICODE_STRING you’ll see the length is
given by a USHORT in units of bytes, despite being a 16-bit wchar_t
string. This is the source of Windows’ maximum command line length
of 32,767 characters (including terminator).

GetCommandLineW is from kernel32.dll, but CommandLineToArgvW is a bit
more off the beaten path in shell32.dll. If you wanted to avoid linking
to shell32.dll for important reasons, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.

The GetCommandLineW has a rough explanation: split arguments on whitespace
(not defined), quoting is involved, and there’s something about counting
backslashes, but only if they stop on a quote. It’s not quite enough to
implement your own, and if you test against it, it’s quickly apparent that
this documentation is at best incomplete. It links to a deprecated page
about parsing C++ command line arguments with a few more details.
Unfortunately the algorithm described on this page is not the algorithm
used by GetCommandLineW, nor is it used by any runtime I could find. It
even varies between Microsoft’s own CRTs. There is no canonical command
line parsing result, not even a de facto standard.

I eventually came across David Deley’s How Command Line Parameters Are
Parsed, which is the closest there is to an authoritative document on
the matter (also). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. argv[0]) follows entirely different
rules, which really confused me for while. The Wine documentation
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

My CommandLineToArgvW

Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s a
little state machine: cmdline.c. The interface:

int cmdline_to_argv8(const wchar_t *cmdline, char **argv);


Unlike the others, mine encodes straight into WTF-8, a superset of
UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative
lines of code: invisible since it involves not reacting to ill-formed
input. If you use the new-ish UTF-8 manifest Win32 feature then your
program cannot handle command line strings with ill-formed UTF-16, a
problem solved by WTF-8.

As documented, that argv must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case argc is 16,383 (plus one more argv slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the argv is
actually used as a char buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.

int mainCRTStartup(void)
{
    static char *argv[CMDLINE_ARGV_MAX];
    int argc = cmdline_to_argv8(cmdline_fetch(), argv);
    return main(argc, argv);
}


Also: Note the FUZZ option in my source. It has been pretty thoroughly
fuzz tested. It didn’t find anything, but it does make me more
confident in the result.

I also peeked at some language runtimes to see how others handle it. Just
as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft
CRT. Also expected, CPython implicitly does whatever the underlying C
runtime does, so its exact command line behavior depends on which version
of Visual Studio was used to build the Python binary. OpenJDK
pragmatically calls CommandLineToArgvW. Go (gc) does its own
parsing, with behavior mixed between CommandLineToArgvW and some of
Microsoft’s CRTs, but not quite matching either.

Building a command line string

I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an argv. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.

Naturally, I also wrote my own implementation (same source):

int cmdline_from_argv8(wchar_t *cmdline, int len, char **argv);


Like before, it accepts a WTF-8 argv, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one can fail when argv is too large, it returns
zero for an error.

char *argv[] = {"python.exe", "-c", code, 0};
wchar_t cmd[CMDLINE_CMD_MAX];
if (!cmdline_from_argv8(cmd, CMDLINE_CMD_MAX, argv)) {
    return "argv too large";
}
if (!CreateProcessW(0, cmd, /*...*/)) {
    return "CreateProcessW failed";
}


How do others handle this?


  
    The aged Emacs implementation is written in C rather than Lisp,
steeped in history with vestigial wrong turns. Emacs still only calls
the “narrow” CreateProcessA despite having every affordance to do
otherwise, and uses the wrong encoding at that. A personal
source of headaches.
  
  
    CPython uses Python rather than C via subprocess.list2cmdline.
While undocumented, it’s accessible on any platform and easy to
test against various inputs. Try it out!
  
  
    Go (gc) is just as delightfully boring I’d expect.
  
  
    OpenJDK optimistically optimizes for command line strings under
80 bytes, and like Emacs, displays the weathering of long use.
  


I don’t plan to write a language implementation anytime soon, where this
might be needed, but it’s nice to know I’ve already solved this problem
for myself!




A new protocol and tool for PNG file attachments
2021-12-31T22:17:26Z
When my articles include diagrams to illustrate a concept, such as a
state machine, I will check the Graphviz, gnuplot, or SVG
source into source control alongside the image in case I need to make
changes in the future. Sometimes I even make the image itself a link to
its source file. I’ve thought it would be convenient if the raster image
somehow contained its own source as metadata so that they don’t get
separated. I looked around and wasn’t satisfied with the solutions I
found, so I wrote one: pngattach.

My approach introduces a new private chunk type atCh (“attachment”)
which contains a file name, a flag to indicate if the attachment is
compressed, and an optionally DEFLATE-compressed blob of file contents. I
tried to follow the spirit of PNG chunk formatting, but without the
constraints I hoped to avoid. A single PNG can contain multiple
attachments, e.g. source file, Makefile, README, license file, etc. The
protocol places constraints on the file names to keep it simple and to
avoid shenanigans: no control bytes (anything below ASCII space), no
directories, and cannot start with a period (no special hidden files). If
that’s too constraining, you could attach a ZIP or TAR.

PNG chunk format

PNG files begin with a fixed 8-byte header followed by of a series of
chunks. Each chunk has an 8-byte header and 4-byte footer. The chunk
header is a 32-bit big endian chunk length (not counting header or footer)
and a 4-byte tag identifying its type. The length allows implementations
to skip chunks it doesn’t recognize.

LLLL TTTT ...chunk... CCCC


The footer is a big endian CRC-32 checksum of the 4-byte type tag and the
chunk body itself.

Chunk tags are interpreted as 4 ASCII characters, where the capitalization
of each letter encodes 4 additional boolean flags. The flags in my tag,
atCh, indicate it’s a non-critical private chunk which doesn’t depend on
the image data.

PNG always ends with a zero-length IEND chunk, which works out to a kind
of 12-byte constant footer.

Existing chunk types

The PNG standard currently defines three kinds of chunks for storing text
metadata: tEXt, iTXt, zTXt. The first is limited to Latin-1 with LF
newlines, and so cannot store UTF-8 source text. The latter two were
introduced in the PNG 1.2 specification (November 1999), and allow (only)
UTF-8 content with LF newlines. All three have a 1 to 79-byte Latin-1
“key” field, and the latter two some additional fields describing the
language of the text.

The key field is null-terminated, making it 80 bytes maximum when treated
as a null-terminated string. I believe this constraint exists to aid
implementations, which can rely on this hard upper limit for the key
lengths they’re expected to handle. Otherwise a key could have been up to
4GiB in length.

I had considered using part of the key as a file name, prefixed with a
custom namespace (ex. attachment:FILENAME) to distinguish it from other
text chunks. However, I didn’t like the constraints this placed on the
file name, plus I wanted to support arbitrary file content, not limited to
a particular subformat.

As prior art, there’s a draw.io/diagrams.net format which embeds a source
string without file name. The source string is encoded in base64 (i.e.
unconstrained by PNG), wrapped in XML, then incorrectly encoded as an
iTXt chunk. The XML alone was enough to keep me away from using this
format.

pngattach details

In my attachment protocol, the file name is an arbitrary length,
null-terminated byte string (preferably UTF-8), much like a key field,
with the previously-mentioned anti-shenanigans restrictions. The file name
is followed by a byte, 0 or 1, indicating if the content is compressed
using PNG’s officially-supported compression format. The rest is the
arbitrary content bytes, which presumably the recipient will know how to
use.

LLLL atCh example.txt 0 F ...contents... CCCC


I expect any experienced programmer could write a basic attachment
extractor in their language of choice inside of 30 or so minutes. Hooking
up a DEFLATE library for decompression would be the most difficult part.

Since it supports multiple attachments and behaves like an archive format,
my tool supports flags much like tar: -c to create attachments
(default and implicit), -t to list attachments, and -x to extract
attachments. PNG data is always passed on standard input and standard
output.

For example, to render a Graphviz diagram and attach the source all at
once:

$ dot -Tpng graph.dot | pngattach graph.dot >graph.png


Later on someone might extract it and tweak it, like so (-v verbose,
lists files as they’re extracted, like tar):

$ pngattach -xv graph.png


Like tar, it can also write attachments to standard output with -O.
For example, to re-render the image as an SVG:

$ pngattach -xO graph.svg


Strictly processing standard input to output, rather than taking the input
as an argument, is something I’ve been trying lately. I’m pretty happy
with my command line design for pngattach. The real test will
happen in the future, when I’ve forgotten the details and have to figure
it out again from my own documentation.

Curiously, lots of common software refuses to handle PNGs containing large
chunks, and so your PNG may not display if you attach a file even as small
as a few MiB. A defense against denial of service?

Example PNG

I haven’t gone back and embedded attachments in any older articles, but I
may do so in future articles. If you wanted to try it out for yourself,
either with my tool or writing your own for fun, this PNG contains a
compressed attachment:



I produced it like so (with the help of ImageMagick):

$ echo P3 1 1 1 0 1 0 |
      convert ppm:- resize 200 png:- |
      pngattach message.txt >atch-test.png


Error handling (addendum)

Another technique I’ve been trying is Go-style error value returns in C
programs, where the errors-as-values are const char * pointers to static
string buffers. The contents contain an error message to be displayed to
the user, and errors may be wrapped in more context (what file, what
operation, etc.) as the stack unwinds. A null pointer means no error, i.e.
nil. I’ve used this extensively in pngattach. Examples of the style:

    int *p;

    if (nelem > (size_t)-1/sizeof(*p)) {
        return "out of memory";  // overflow
    }

    p = malloc(nelem*sizeof(*p));
    if (!p) {
        return "out of memory";
    }

    // ...

    if (!fwrite(buf, len, 1, stdout)) {
        free(p);
        return "write error";
    }


An errwrap() function builds a new error string in a static buffer. This
simple solution wouldn’t work in a multi-threaded program, but that’s not
the case here. Mine toggles between two static buffers so that it can wrap
recursively.

const char *
errwrap(const char *pre, const char *post)
{
    static char errtmp[2][256], i;
    int n = i = !i;  // toggle between two static buffers
    snprintf(errtmp[n], sizeof(errtmp[n]), "%s: %s", pre, post);
    return errtmp[n];
}


Then I can do stuff like:

    FILE *f = fopen(path, "wb");
    if (!f) {
        return errwrap("failed to open file", path);
    }


And that can keep being wrapped on the way up:

    err = png_write(path);
    if (err) {
        return errwrap("writing PNG", err);
    }


So that ultimately the user sees something like:

pngattach: writing PNG: failed to open file: example.png


That’s always printed by a single error printout block at the top level,
where all errors are ultimately routed.

int main(int argc, char **argv)
{
    // ...

    err = run(options);
    if (err) {
        fprintf(stderr, "pngattach: %s\n", err);
        return 1;
    }
    return 0;
}





Some sanity for C and C++ development on Windows
2021-12-30T23:25:53Z
A hard reality of C and C++ software development on Windows is that there
has never been a good, native C or C++ standard library implementation for
the platform. A standard library should abstract over the underlying host
facilities in order to ease portable software development. On Windows, C
and C++ is so poorly hooked up to operating system interfaces that most
portable or mostly-portable software — programs which work perfectly
elsewhere — are subtly broken on Windows, particularly outside of the
English-speaking world. The reasons are almost certainly political,
originally motivated by vendor lock-in, than technical, which adds insult
to injury. This article is about what’s wrong, how it’s wrong, and some
easy techniques to deal with it in portable software.

There are multiple C implementations, so how could they all be
bad, even the early ones? Microsoft’s C runtime has defined how
the standard library should work on the platform, and everyone else
followed along for the sake of compatibility. I’m excluding Cygwin and
its major fork, MSYS2, despite not inheriting any of these flaws. They
change so much that they’re effectively whole new platforms, not truly
“native” to Windows.

In practice, C++ standard libraries are implemented on top of a C standard
library, which is why C++ shares the same problems. CPython dodges these
issues: Though written in C, on Windows it bypasses the broken C standard
library and directly calls the proprietary interfaces. Other language
implementations, such “gc” Go, simply aren’t built on C at all, and
instead do things correctly in the first place — the behaviors the C
runtimes should have had all along.

If you’re just working on one large project, bypassing the C runtime isn’t
such a big deal, and you’re likely already doing so to access important
platform functionality. You don’t really even need a C runtime. However,
if you write many small programs, as I do, writing the same
special Windows support for each one ends up being most of the work, and
honestly makes properly supporting Windows not worth the trouble. I end up
just accepting the broken defaults most of the time.

Before diving into the details, if you’re looking for a quick-and-easy
solution for the Mingw-w64 toolchain, including w64devkit, which
magically makes your C and C++ console programs behave well on Windows,
I’ve put together a “library” named libwinsane. It solves all
problems discussed in this article, except for one. No source changes
required, simply link it into your program.

What exactly is broken?

The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix,
and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy
API, where an active code page maps 256 bytes onto (up to) 256 specific
characters. On typical machines configured for European languages, this
means code page 1252. Roughly speaking, Windows
internally uses UTF-16, and calls through the narrow interface use the
active code page to translate the narrow strings to wide strings. The
result is that calls through the narrow API have limited access to the
system.

The UTF-8 encoding was invented in 1992 and standardized by January 1993.
UTF-8 was adopted by the unix world over the following years due to its
backwards-compatibility with its existing interfaces. Programs
could read and write Unicode data, access Unicode paths, pass Unicode
arguments, and get and set Unicode environment variables without needing
to change anything. Today UTF-8 has become the dominant text encoding
format in the world, in large part due to the world wide web.

In July 1993, Microsoft introduced the wide Windows API with the release
of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather
than UTF-8. This turned out to be a mistake, since UTF-16 is inferior to
UTF-8 in practically every way, though admittedly some problems
weren’t so obvious at the time.

The major problem: The C and C++ standard libraries only hook up to the
narrow Windows interfaces. The standard library, and therefore typical
portable software on Windows, cannot handle anything but ASCII. The
effective result is that these programs:


  Cannot accept non-ASCII arguments
  Cannot get/set non-ASCII environment variables
  Cannot access non-ASCII paths
  Cannot read and write non-ASCII on a console


Doing any of these requires calling proprietary functions, treating
Windows as a special target. It’s part of what makes correctly porting
software to Windows so painful.

The sensible solution would have been for the C runtime to speak UTF-8 and
connect to the wide API. Alternatively, the narrow API could have been
changed over to UTF-8, phasing out the old code page concept. In theory
this is what the UTF-8 “code page” is about, though it doesn’t always
work. There would have been compatibility problems with abruptly making
such a change, but until very recently, this wasn’t even an option. Why
couldn’t there be a switch I could flip to get sane behavior that works
like every other platform?

How to mostly fix Unicode support

In 2019, Microsoft introduced a feature to allow programs to request
UTF-8 as their active code page on start, along with supporting
UTF-8 on more narrow API functions. This is like the magic switch I
wanted, except that it involves embedding some ugly XML into your binary
in a particular way. At least it’s now an option.

For Mingw-w64, that means writing a resource file like so:

#include 
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"


Compiling it with windres:

$ windres -o manifest.o manifest.rc


Then linking that into your program. Amazingly it mostly works! Programs
can access Unicode arguments, Unicode environment variables, and Unicode
paths, including with fopen, just as it’s worked on other platforms for
decades. Since the active code page is set at load time, it happens before
argv is constructed (from GetCommandLineA), which is why that works
out.

Alternatively you could create a “side-by-side assembly” placing that XML
in a file with the same name as your EXE but with .manifest suffix
(after the .exe suffix), then placing that next to your EXE. Just be
mindful that there’s a “side-by-side” cache (WinSxS), and so it might not
immediately pick up your changes.

What doesn’t work is console input and output since the console is
external to the process, and so isn’t covered by the process’s active code
page. It must be configured separately using a proprietary call:

SetConsoleOutputCP(CP_UTF8);


Annoying, but at least it’s not that painful. This only covers output,
though, meaning programs can only print UTF-8. Unfortunately UTF-8 input
still doesn’t work, and setting the input code page doesn’t do
anything despite reporting success:

SetConsoleCP(CP_UTF8);  // doesn't work


If you care about reading interactive Unicode input, you’re stuck
bypassing the C runtime since it’s still broken.

Text stream translation

Another long-standing issue is that C and C++ on Windows has distinct
“text” and “binary” streams, which it inherited from DOS. Mainly this
means automatic newline conversion between CRLF and LF. The C standard
explicitly allows for this, though unix-like platforms have never actually
distinguished between text and binary streams.

The standard also specifies that standard input, output, and error are all
open as text streams, and there’s no portable method to change the stream
mode to binary — a serious deficiency with the standard. On unix-likes
this doesn’t matter, but on Windows it means programs can’t read or write
binary data on standard streams without calling a non-standard function.
It also means reading and writing standard streams is slow, frequently a
bottleneck unless I route around it.

Personally, I like writing binary data to standard output,
including video, and sometimes binary filters that also read
binary input. I do it so often that in probably half my C programs I have
this snippet in main just so they work correctly on Windows:

    #ifdef _WIN32
    int _setmode(int, int);
    _setmode(0, 0x8000);
    _setmode(1, 0x8000);
    #endif


That incantation sets standard input and output in the C runtime to binary
mode without the need to include a header, making it compact, simple, and
self-contained.

This built-in newline translation, along with the Windows standard text
editor, Notepad, lagging decades behind, meant that many other
programs, including Git, grew their own, annoying, newline conversion
misfeatures that cause other problems.

libwinsane

I introduced libwinsane at the beginning of the article, which fixes all
this simply by being linked into a program. It includes the magic XML
manifest .rsrc section, configures the console for UTF-8 output, and
sets standard streams to binary before main (via a GCC constructor). I
called it a “library”, but it’s actually a single object file. It can’t be
a static library since it must be linked into the program despite not
actually being referenced by the program.

So normally this program:

#include 
#include 

int main(int argc, char **argv)
{
    char *arg = argv[argc-1];
    size_t len = strlen(arg);
    printf("%zu %s\n", len, arg);
}


Compiled and run:

C:\>cc -o example example.c
C:\>example π
1 p


As usual, the Unicode argument is silently mangled into one byte. Linked
with libwinsane, it just works like everywhere else:

C:\>gcc -o example example.c libwinsane.o
C:\>example π
2 π


If you’re maintaining a substantial program, you probably want to copy and
integrate the necessary parts of libwinsane into your project and build,
rather than always link against this loose object file. This is more for
convenience and for succinctly capturing the concept. You may even want to
enable ANSI escape processing in your version.

Update December 2024: Pavel Galkin demonstrates how libwinsane.o
changes the console state, which affects all processes associated
with the terminal. This is mostly unavoidable, and it’s one reason I’ve
since concluded that UTF-8 manifests are a poor solution. Better to solve
the problem using a platform layer.




Fast CSV processing with SIMD
2021-12-04T01:13:33Z
This article was discussed on Hacker News.

I recently learned of csvquote, a tool that encodes troublesome
CSV characters such that unix tools can correctly process them. It
reverses the encoding at the end of the pipeline, recovering the original
input. The original implementation handles CSV quotes using the
straightforward, naive method. However, there’s a better approach that is
not only simpler, but around 3x faster on modern hardware. Even more,
there’s yet another approach using SIMD intrinsics, plus some bit
twiddling tricks, which increases the processing speed by an order of
magnitude. My csvquote implementation includes both
approaches.



Background

Records in CSV data are separated by line feeds, and fields are separated
by commas. Fields may be quoted.

aaa,bbb,ccc
xxx,"yyy",zzz


Fields containing a line feed (U+000A), quotation mark (U+0022), or comma
(U+002C), must be quoted, otherwise they would be ambiguous with the CSV
formatting itself. Quoted quotation marks are turned into a pair of
quotes. For example, here are two records with two fields apiece:

"George Herman ""Babe"" Ruth","1919–1921, 1923, 1926"
"Frankenstein;
or, The Modern Prometheus",Mary Shelley


A CSV-unaware tool splitting on commas and line feeds (ex. awk) would
process these records improperly. So csvquote translates quoted line feeds
into record separators (U+001E) and commas into unit separators (U+001F).
These control characters rarely appear in normal text data, and can be
trivially processed in UTF-8-encoded text without decoding or encoding.
The above records become:

"George Herman ""Babe"" Ruth","1919–1921\x1f 1923\x1f 1926"
"Frankenstein;\x1eor\x1f The Modern Prometheus",Mary Shelley


I’ve used \x1e and \x1f here to illustrate the control characters.

The data is exactly the same length since it’s a straight byte-for-byte
replacement. Quotes are left entirely untouched. The challenge is parsing
the quotes to track whether the two special characters fall inside or
outside pairs of quotes.

State machine improvements

The original csvquote walks the input a byte at a time and is in one of
three states:


  Outside quotes (initial state)
  Inside quotes
  On a possibly “escaped” quote (the first " in a "")


Since I love state machines so much, here it is translated into a
switch-based state machine:

// Return the next state given an input character.
int next(int state, int c)
{
    switch (state) {
    case 1: return c == '"' ? 2 : 1;
    case 2: return c == '"' ? 3 : 2;
    case 3: return c == '"' ? 2 : 1;
    }
}




The real program also has more conditions for potentially making a
replacement. It’s an awful lot of performance-killing branching.

However, this context is about finding “in” and “out” — not validating
the CSV — so the “escape” state is unnecessary. I need only match up pairs
of quotes. An “escaped” quote can be considered terminating a quoted
region and immediately starting a new quoted region. That’s means there’s
just the first two states in a trivial arrangement:

int next(int state, int c)
{
    switch (state) {
    case 1: return c == '"' ? 2 : 1;
    case 2: return c == '"' ? 1 : 2;
    }
}




Since the text can be processed as bytes, there are only 256 possible
inputs. With 2 states and 256 inputs, this state machine, with
replacement machinery, can be implemented with a 512-byte table and no
branches. Here’s the table initialization:

unsigned char table[2][256];

void init(void)
{
    for (int i = 0; i < 256; i++) {
        table[0][i] = i;
        table[1][i] = i;
    }
    table[1]['\n'] = 0x1e;
    table[1][',']  = 0x1f;
}


In the first state, characters map onto themselves. In the second state,
characters map onto their replacements. This is the entire encoder and
decoder:

void encode(unsigned char *buf, size_t len)
{
    int state = 0;
    for (size_t i = 0; i < len; i++) {
        state ^= (buf[i] == '"');
        buf[i] = table[state][buf[i]];
    }
}


Well, strictly speaking, the decoder need not process quotes. By my
benchmark (csvdump in my implementation) this processes at ~1 GiB/s on
my laptop — 3x faster than the original. However, there’s still
low-hanging fruit to be picked!

SIMD and two’s complement

Any decent SIMD implementation is going to make use of masking. Find the
quotes, compute a mask over quoted regions, compute another mask for
replacement matches, combine the masks, then use that mask to blend the
input with the replacements. Roughly:

quotes    = find_quoted_regions(input)
linefeeds = input == '\n'
commas    = input == ','
output    = blend(input, '\n', quotes & linefeeds)
output    = blend(output, ',', quotes & commas)


The hard part is computing the quote mask, and also somehow handle quoted
regions straddling SIMD chunks (not pictured), and do all that without
resorting to slow byte-at-time operations. Fortunately there are some
bitwise tricks that can resolve each issue.

Imagine I load 32 bytes into a SIMD register (e.g. AVX2), and I compute a
32-bit mask where each bit corresponds to one byte. If that byte contains
a quote, the corresponding bit is set.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001010


That last/lowest 1 corresponds to the beginning of a quoted region. For my
mask, I’d like to set all bits following that bit. I can do this by
subtracting 1.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001001


Using the Kernighan technique I can also remove this bit from the
original input by ANDing them together.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001000


Now I’m left with a new bottom bit. If I repeat this, I build up layers of
masks, one for each input quote.

10000000000000011000011000001001
10000000000000011000011000000111
10000000000000011000010111111111
10000000000000011000001111111111
10000000000000010111111111111111
10000000000000001111111111111111
01111111111111111111111111111111


Remember how I use XOR in the state machine above to toggle between
states? If I XOR all these together, I toggle the quotes on and off,
building up quoted regions:

"George Herman ""Babe"" Ruth","1
01111111111111100111100111110001


However, for reasons I’ll explain shortly, it’s critical that the opening
quote is included in this mask. If I XOR the pre-subtracted value with the
mask when I compute the mask, I can toggle the remaining quotes on and off
such that the opening quotes are included. Here’s my function:

uint32_t find_quoted_regions(uint32_t x)
{
    uint32_t r = 0;
    while (x) {
        r ^= x;
        r ^= x - 1;
        x &= x - 1;
    }
    return r;
}


Which gives me exactly what I want:

"George Herman ""Babe"" Ruth","1
11111111111111101111101111110011


It’s important that the opening quote is included because it means a
region that begins on the last byte will have that last bit set. I can use
that last bit to determine if the next chunk begins in a quoted state. If
a region begins in a quoted state, I need only NOT the whole result to
reverse the quoted regions.

How can I “sign extend” a 1 into all bits set, or do nothing for zero?
Negate it!

    uint32_t carry  = -(prev & 1);
    uint32_t quotes = find_quoted_regions(input) ^ carry;
    // ...
    prev = quotes;


That takes care of computing quoted regions and chaining them between
chunks. The loop will unfortunately cause branch prediction penalties if
the input has lots of quotes, but I couldn’t find a way around this.

However, I’ve made a serious mistake. I’m using _mm256_movemask_epi8 and
it puts the first byte in the lowest bit. Doh! That means it looks like
this:

1","htuR ""ebaB"" namreH egroeG"
01010000011000011000000000000001


There’s no efficient way to flip the bits around, so I just need to find a
way to work in the other direction. To flip the bits to the left of a set
bit, negate it.

00000000000000000000000010000000 = +0x00000080
11111111111111111111111110000000 = -0x00000080


Unlike before, this keeps the original bit set, so I need to XOR the
original value into the input to flip the quotes. This is as simple as
initializing to the input rather than zero. The new loop:

uint32_t find_quoted_regions(uint32_t x)
{
    uint32_t r = x;
    while (x) {
        r ^= -x ^ x;
        x &= x - 1;
    }
    return r;
}


The result:

1","htuR ""ebaB"" namreH egroeG"
11001111110111110111111111111111


The carry now depends on the high bit rather than the low bit:

uint32_t carry = -(prev >> 31);


Reversing movemask

The next problem: for reasons I don’t understand, AVX2 does not include
the inverse of _mm256_movemask_epi8. Converting the bit-mask back into a
byte-mask requires some clever shuffling. Fortunately I’m not the first
to have this problem, and so I didn’t have to figure it out from
scratch.

First fill the 32-byte register with repeated copies of the 32-bit mask.

abcdabcdabcdabcdabcdabcdabcdabcd


Shuffle the bytes so that the first 8 register bytes have the same copy of
the first bit-mask byte, etc.

aaaaaaaabbbbbbbbccccccccdddddddd


In byte 0, I care only about bit 0, in byte 1 I care only about the bit 1,
… in byte N I care only about bit N%8. I can pre-compute a mask to
isolate each of these bits and produce a proper byte-wise mask from the
bit-mask. Fortunately all this isn’t too bad: four instructions instead of
the one I had wanted. It looks like a lot of code, but it’s really only a
few instructions.

Results

In my benchmark, which includes randomly occurring quoted fields, the SIMD
version processes at ~4 GiB/s — 10x faster than the original. I haven’t
profiled, but I expect mispredictions on the bit-mask loop are the main
obstacle preventing the hypothetical 32x speedup.

My version also optionally rejects inputs containing the two special
control characters since the encoding would be irreversible. This is
implemented in SIMD when available, and it slows processing by around 10%.

Followup: PCLMULQDQ

Geoff Langdale and others have graciously pointed out PCLMULQDQ,
which can compute the quote masks using carryless multiplication
(also) entirely in SIMD and without a loop. I haven’t yet quite
worked out exactly how to apply it, but it should be much faster.




OpenBSD's pledge and unveil from Python
2021-09-15T02:46:56Z
This article was discussed on Hacker News.

Years ago, OpenBSD gained two new security system calls, pledge(2)
(originally tame(2)) and unveil. In both, an application
surrenders capabilities at run-time. The idea is to perform initialization
like usual, then drop capabilities before handling untrusted input,
limiting unwanted side effects. This feature is applicable even where type
safety isn’t an issue, such as Python, where a program might still get
tricked into accessing sensitive files or making network connections when
it shouldn’t. So how can a Python program access these system calls?

As discussed previously, it’s quite easy to access C APIs from
Python through its ctypes package, and this is no exception.
In this article I show how to do it. Here’s the full source if you want to
dive in: openbsd.py.



I’ve chosen these extra constraints:


  
    As extra safety features, unnecessary for correctness, attempts to call
these functions on systems where they don’t exist will silently do
nothing, as though they succeeded. They’re provided as a best effort.
  
  
    Systems other than OpenBSD may support these functions, now or in the
future, and it would be nice to automatically make use of them when
available. This means no checking for OpenBSD specifically but instead
feature sniffing for their presence.
  
  
    The interfaces should be Pythonic as though they were implemented in
Python itself. Raise exceptions for errors, and accept strings since
they’re more convenient than bytes.
  


For reference, here are the function prototypes:

int pledge(const char *promises, const char *execpromises);
int unveil(const char *path, const char *permissions);


The string-oriented interface of pledge will make this a whole
lot easier to implement.

Finding the functions

The first step is to grab functions through ctypes. Like a lot of Python
documentation, this area is frustratingly imprecise and under-documented.
I want to grab a handle to the already-linked libc and search for either
function. However, getting that handle is a little different on each
platform, and in the process I saw four different exceptions, only one of
which is documented.

I came up with passing None to ctypes.CDLL, which ultimately just passes
NULL to dlopen(3). That’s really all I wanted. Currently on
Windows this is a TypeError. Once the handle is in hand, try to access the
pledge attribute, which will fail with AttributeError if it doesn’t
exist. In the event of any exception, just assume the behavior isn’t
available. If found, I also define the function prototype for ctypes.

_pledge = None
try:
    _pledge = ctypes.CDLL(None, use_errno=True).pledge
    _pledge.restype = ctypes.c_int
    _pledge.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _pledge = None


Catching a broad Exception isn’t great, but it’s the best we can do since
the documentation is incomplete. From this block I’ve seen TypeError,
AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if
there are more possibilities, and I don’t want to risk missing them.

Note that I’m catching Exception rather than using a bare except. My
code will not catch KeyboardInterrupt nor SystemExit. This is deliberate,
and I never want to catch these.

The same story for unveil:

_unveil = None
try:
    _unveil = ctypes.CDLL(None, use_errno=True).unveil
    _unveil.restype = ctypes.c_int
    _unveil.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _unveil = None


Pythonic wrappers

The next and final step is to wrap the low-level call in an interface that
hides their C and ctypes nature.

Python strings must be encoded to bytes before they can be passed to C
functions. Rather than make the caller worry about this, we’ll let them
pass friendly strings and have the wrapper do the conversion. Either may
also be NULL, so None is allowed.

def pledge(promises: Optional[str], execpromises: Optional[str]):
    if not _pledge:
        return  # unimplemented

    r = _pledge(None if promises is None else promises.encode(),
                None if execpromises is None else execpromises.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))


As usual, a return of -1 means there was an error, in which case we fetch
errno and raise the appropriate OSError.

unveil works a little differently since the first argument is a path.
Python functions that accept paths, such as open, generally accept
either strings or bytes. On unix-like systems, paths are fundamentally
bytestrings and not necessarily Unicode, so it’s necessary to accept
bytes. Since strings are nearly always more convenient, they take both.
The unveil wrapper here will do the same. If it’s a string, encode it,
otherwise pass it straight through.

def unveil(path: Union[str, bytes, None], permissions: Optional[str]):
    if not _unveil:
        return  # unimplemented

    r = _unveil(path.encode() if isinstance(path, str) else path,
                None if permissions is None else permissions.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))


That’s it!

Trying it out

Let’s start with unveil. Initially a process has access to the whole
file system with the usual restrictions. On the first call to unveil
it’s immediately restricted to some subset of the tree. Each call reveals
a little more until a final NULL which locks it in place for the rest of
the process’s existence.

Suppose a program has been tricked into accessing your shell history,
perhaps by mishandling a path:

def hackme():
    try:
        with open(pathlib.Path.home() / ".bash_history"):
            print("You've been hacked!")
    except FileNotFoundError:
        print("Blocked by unveil.")

hackme()


If you’re a Bash user, this prints:

You've been hacked!


Using our new feature to restrict the program’s access first:

# restrict access to static program data
unveil("/usr/share", "r")
unveil(None, None)

hackme()


On OpenBSD this now prints:

Blocked by unveil.


Working just as it should!

With pledge we declare what abilities we’d like to keep by supplying a
list of promises, pledging to use only those abilities afterward. A
common case is the stdio promise which allows reading and writing of
open files, but not opening files. A program might open its log file,
then drop the ability to open files while retaining the ability to write
to its log.

An invalid or unknown promise is an error. Does that work?

>>> pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument


So far so good. How about the functionality itself?

pledge("stdio", None)
hackme()


The program is instantly killed when making the disallowed system call:

Abort trap (core dumped)


If you want something a little softer, include the error promise:

pledge("stdio error", None)
hackme()


Instead it’s an exception, which will be a lot easier to debug when it
comes to Python, so you probably always want to use it.

OSError: [Errno 78] Function not implemented


The core dump isn’t going to be much help to a Python program, so you
probably always want to use this promise. In general you need to be extra
careful about pledge in complex runtimes like Python’s which may
reasonably need to do many arbitrary, undocumented things at any time.




Billions of Code Name Permutations in 32 bits
2021-09-14T21:06:59Z
My friend over at Possibly Wrong created a code name generator. By
coincidence I happened to be thinking about code names myself while
recently replaying XCOM: Enemy Within (2012/2013). The game
generates a random code name for each mission, and I wondered how often it
repeats. The UFOpaedia page on the topic gives the word lists: 53
adjectives and 76 nouns, for a total of 4028 possible code names. A
typical game has around 60 missions, and if code names are generated
naively on the fly, then per the birthday paradox around half of all games
will see a repeated mission code name! Fortunately this is easy to avoid,
and the particular configuration here lends itself to an interesting
implementation.

Mission code names are built using “adjective noun”. Some examples
from the game’s word list:


  Fading Hammer
  Fallen Jester
  Hidden Crown


To generate a code name, we could select a random adjective and a random
noun, but as discussed it wouldn’t take long for a collision. The naive
approach is to keep a database of previously-generated names, and to
consult this database when generating new names. That works, but there’s
an even better solution: use a random permutation. Done well, we don’t
need to keep track of previous names, and the generator won’t repeat until
it’s exhausted all possibilities.

Further, the total number of possible code names, 4028, is suspiciously
shy of 4,096, a power of two (2**12). That makes designing and
implementing an efficient permutation that much easier.

A linear congruential generator

A classic, obvious solution is a linear congruential generator
(LCG). A full-period, 12-bit LCG is nothing more than a permutation of the
numbers 0 to 4,095. When generating names, we can skip over the extra 68
values and pretend it’s a permutation of 4,028 elements. An LCG is
constructed like so:

f(n) = (f(n-1)*A + C) % M


Typically the seed is used for f(0). M is selected based on the problem
space or implementation efficiency, and usually a power of two. In this
case it will be 4,096. Then there are some rules for choosing A and C.

Simply choosing a random f(0) per game isn’t great. The code name order
will always be the same, and we’re only choosing where in the cycle to
start. It would be better to vary the permutation itself, which we can do
by also choosing unique A and C constants per game.

Choosing C is easy: It must be relatively prime with M, i.e. it must be
odd. Since it’s addition modulo M, there’s no reason to choose C >= M
since the results are identical to a smaller C. If we think of C as a
12-bit integer, 1 bit is locked in, and the other 11 bits are free to
vary:

xxxxxxxxxxx1


Choosing A is more complicated: must be odd, A-1 must be divisible by 4,
and A-1 should be divisible by 8 (better results). Again, thinking of
this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits
free:

xxxxxxxxx101


This ensures all the must and should properties of A.

Finally 0 <= f(0) < M. Because of modular arithmetic larger, values are
redundant, and all possible values are valid since the LCG, being
full-period, will cycle through all of them. This is just choosing the
starting point in a particular permutation cycle. As a 12-bit number, all
12 bits are free:

xxxxxxxxxxxx


That’s 9 + 11 + 12 = 32 free bits to fill randomly: again, how
incredibly convenient! Every 32-bit integer defines some unique code name
permutation… almost. Any 32-bit descriptor where f(0) >= 4028 will
collide with at least one other due to skipping, and so around 1.7% of the
state space is redundant. A small loss that should shrink with slightly
better word list planning. I don’t think anyone will notice.

Slice and dice

I love compact state machines, and this is an opportunity to put one
to good use. My code name generator will be just one function:

uint32_t codename(uint32_t state, char *buf);


This takes one of those 32-bit permutation descriptors, writes the first
code name to buf, and returns a descriptor for another permutation that
starts with the next name. All we have to do is keep track of that 32-bit
number and we’ll never need to worry about repeating code names until all
have been exhausted.

First, lets extract A, C, and f(0), which I’m calling S. The low bits
are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to
lock in the hard-set bits.

long a = (state <<  3 | 5) & 0xfff;  //  9 bits
long c = (state >>  8 | 1) & 0xfff;  // 11 bits
long s =  state >> 20;               // 12 bits


Next iterate the LCG until we have a number in range:

do {
    s = (s*a + c) & 0xfff;
} while (s >= 4028);


Once we have an appropriate LCG state, compute the adjective/noun indexes
and build a code name:

int i = s % 53;
int j = s / 53;
sprintf(buf, "%s %s", adjvs[i], nouns[j]);


Finally assemble the next 32-bit state. Since A and C don’t change, these
are passed through while the old S is masked out and replaced with the new
S.

return (state & 0xfffff) | (uint32_t)s<<20;


Putting it all together:

static const char *adjvs[] = { /* ... */ };
static const char *nouns[] = { /* ... */ };

uint32_t codename(uint32_t state, char *buf)
{
    long a = (state <<  3 | 5) & 0xfff;  //  9 bits
    long c = (state >>  8 | 1) & 0xfff;  // 11 bits
    long s =  state >> 20;               // 12 bits

    do {
        s = (s*a + c) & 0xfff;
    } while (s >= COUNTOF(adjvs)*COUNTOF(nouns));

    int i = s % COUNTOF(adjvs);
    int j = s / COUNTOF(adjvs);
    sprintf(buf, "%s %s", adjvs[i], nouns[j]);
    return (state & 0xfffff) | (uint32_t)s<<20;
}


The caller just needs to generate an initial 32-bit integer. Any 32-bit
integer is valid — even zero — so this could just be, say, the unix epoch
(time(2)), but adjacent values will have similar-ish permutations. I
intentionally placed S in the high bits, which are least likely to vary,
since it only affects where the cycle begins, while A and C have a much
more dramatic impact and so are placed at more variable locations.

Regardless, it would be better to hash such an input so that adjacent time
values map to distant states. It also helps hide poorer (less random)
choices for A multipliers. I happen to have designed some great functions
for exactly this purpose. Here’s one of my best:

static uint32_t
hash32(uint32_t x)
{
    x += 0x3243f6a8U; x ^= x >> 15;
    x *= 0xd168aaadU; x ^= x >> 15;
    x *= 0xaf723597U; x ^= x >> 15;
    return x;
}


This would be perfectly reasonable for generating all possible names in a
random order:

uint32_t state = hash32(time(0));
for (int i = 0; i < 4028; i++) {
    char buf[32];
    state = codename(state, buf);
    puts(buf);
}


To further help cover up poorer A multipliers, it’s better for the word
list to be pre-shuffled in its static storage. If that underlying order
happens to show through, at least it will be less obvious (i.e. not in
alphabetical order). Shuffling the string list in my source is just a few
keystrokes in Vim, so this is easy enough.

Robustness

If you’re set on making the codename function easier to use such that
consumers don’t need to think about hashes, you could “encode” and
“decode” the descriptor going in an out of the function:

uint32_t codename(uint32_t state, char *buf)
{
    state += 0x3243f6a8U; state ^= state >> 17;
    state *= 0x9e485565U; state ^= state >> 16;
    state *= 0xef1d6b47U; state ^= state >> 16;

    // ...

    state = (state & 0xfffff) | (uint32_t)s<<20;
    state ^= state >> 16; state *= 0xeb00ce77U;
    state ^= state >> 16; state *= 0x88ccd46dU;
    state ^= state >> 17; state -= 0x3243f6a8U;
    return state;
}


This permutes the state coming in, and reverses that permutation on the
way out (read: inverse hash). This breaks up similar starting points.

A random-access code name permutation

Of course this isn’t the only way to build a permutation. I recently
picked up another trick: Kensler permutation. The key insight
is cycle-walking, allowing for random-access to a permutation of a smaller
domain (e.g. 4,028 elements) through permutation of a larger domain (e.g.
4096 elements).

Here’s such a code name generator built around a bespoke 12-bit
xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply
is less effective the smaller the permutation.

// Generate the nth code name for this seed.
void codename_n(char *buf, uint32_t seed, int n)
{
    uint32_t i = n;
    do {
        i ^= i >> 6; i ^= seed >>  0; i *= 0x325; i &= 0xfff;
        i ^= i >> 6; i ^= seed >>  8; i *= 0x3f5; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 16; i *= 0xa89; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 24; i *= 0x85b; i &= 0xfff;
        i ^= i >> 6;
    } while (i >= COUNTOF(adjvs)*COUNTOF(nouns));

    int a = i % COUNTOF(adjvs);
    int b = i / COUNTOF(adjvs);
    snprintf(buf, 22, "%s %s", adjvs[a], nouns[b]);
}


While this is more flexible, avoids poorer permutations, and doesn’t have
state space collisions, I still have a soft spot for my LCG-based state
machine generator.

Source code

You can find the complete, working source code with both generators here:
codename.c. I used real US Secret Service code names for
my word list. Some sample outputs:


  PLASTIC HUMMINGBIRD
  BLACK VENUS
  SILENT SUNBURN
  BRONZE AUTHOR
  FADING MARVEL





Test cross-architecture without leaving home
2021-08-21T23:59:33Z
I like to test my software across different environments, on strange
platforms, and with alternative implementations. Each has its
own quirks and oddities that can shake bugs out earlier. C is particularly
good at this since it has such a wide selection of compilers and runs on
everything. For instance I count at least 7 distinct C compilers in Debian
alone. One advantage of writing portable software is access to a
broader testing environment, and it’s one reason I prefer to target
standards rather than specific platforms.

However, I’ve long struggled with architecture diversity. My work and
testing has been almost entirely on x86, with ARM as a distant second
(Raspberry Pi and friends). Big endian hosts are particularly rare.
However, I recently learned a trick for quickly and conveniently accessing
many different architectures without even leaving my laptop: QEMU User
Emulation. Debian and its derivatives support this very well and
require almost no setup or configuration.



Cross-compilation Example

While there are many options, my main cross-testing architecture has been
PowerPC. It’s 32-bit big endian, while I’m generally working on 64-bit
little endian, which is exactly the sort of mismatch I’m going for. I use
a Debian-supplied cross-compiler and qemu-user tools. The binfmt
support is especially slick, so that’s how I usually use it.

# apt install gcc-powerpc-linux-gnu qemu-user-binfmt


binfmt_misc is a kernel module that teaches Linux how to recognize
arbitrary binary formats. For instance, there’s a Wine binfmt so that
Linux programs can transparently exec(3) Windows .exe binaries. In the
case of QEMU User Mode, binaries for foreign architectures are loaded into
a QEMU virtual machine configured in user mode. In user mode there’s no
guest operating system, and instead the virtual machine translates guest
system calls to the host operating system.

The first package gives me powerpc-linux-gnu-gcc. The prefix is the
architecture tuple describing the instruction set and system ABI.
To try this out, I have a little test program that inspects its execution
environment:

#include 

int main(void)
{
    char *w = "?";
    switch (sizeof(void *)) {
    case 1: w = "8";  break;
    case 2: w = "16"; break;
    case 4: w = "32"; break;
    case 8: w = "64"; break;
    }

    char *b = "?";
    switch (*(char *)(int []){1}) {
    case 0: b = "big";    break;
    case 1: b = "little"; break;
    }

    printf("%s-bit, %s endian\n", w, b);
}


When I run this natively on x86-64:

$ gcc test.c
$ ./a.out
64-bit, little endian


Running it on PowerPC via QEMU:

$ powerpc-linux-gnu-gcc -static test.c
$ ./a.out
32-bit, big endian


Thanks to binfmt, I could execute it as though the PowerPC binary were a
native binary. With just a couple of environment variables in the right
place, I could pretend I’m developing on PowerPC — aside from emulation
performance penalties of course.

However, you might have noticed I pulled a sneaky on ya: -static. So far
what I’ve shown only works with static binaries. There’s no dynamic loader
available to run dynamically-linked binaries. Fortunately this is easy to
fix in two steps. The first step is to install the dynamic linker for
PowerPC:

# apt install libc6-powerpc-cross


The second is to tell QEMU where to find it since, unfortunately, it
cannot currently do so on its own.

$ export QEMU_LD_PREFIX=/usr/powerpc-linux-gnu


Now I can leave out the -static:

$ powerpc-linux-gnu-gcc test.c
$ ./a.out
32-bit, big endian


A practical example: Remember binitools? I’m now ready to run its
fuzz-generated test suite on this cross-testing platform.

$ git clone https://github.com/skeeto/binitools
$ cd binitools/
$ make check CC=powerpc-linux-gnu-gcc
...
PASS: 668/668


Or if I’m going to be running make often:

$ export CC=powerpc-linux-gnu-gcc
$ make -e check


Recall: make’s -e flag passes the environment through, so I
don’t need to pass CC=... on the command line each time.

When setting up a test suite for your own programs, consider how difficult
it would be to run the tests under customized circumstances like this. The
easier it is to run your tests, the more they’re going to be run. I’ve run
into many projects with such overly-complex test builds that even enabling
sanitizers in the tests suite was a pain, let alone cross-architecture
testing.

Dependencies? There might be a way to use Debian’s multiarch support
to install these packages, but I haven’t been able to figure it out. You
likely need to build dependencies yourself using the cross compiler.

Testing with Go

None of this is limited to C (or even C++). I’ve also successfully used
this to test Go libraries and programs cross-architecture. This isn’t
nearly as important since it’s harder to write unportable Go than C — e.g.
dumb pointer tricks are literally labeled “unsafe”. However, Go
(gc) trivializes cross-compilation and is statically compiled, so it’s
incredibly simple. Once you’ve installed qemu-user-binfmt it’s entirely
transparent:

$ GOARCH=mips64 go test


That’s all there is to cross-platform testing. If for some reason binfmt
doesn’t work (WSL) or you don’t want to install it, there’s just one extra
step (package named example):

$ GOARCH=mips64 go test -c
$ qemu-mips64-static example.test


The -c option builds a test binary but doesn’t run it, instead allowing
you to choose where and how to run it.

It even works with cgo — if you’re willing to jump through the same
hoops as with C of course:

package main

// #include 
// uint16_t v = 0x1234;
// char *hi = (char *)&v + 0;
// char *lo = (char *)&v + 1;
import "C"
import "fmt"

func main() {
	fmt.Printf("%02x %02x\n", *C.hi, *C.lo)
}


With go run on x86-64:

$ CGO_ENABLED=1 go run example.go
34 12


Via QEMU User Mode:

$ export CGO_ENABLED=1
$ export GOARCH=mips64
$ export CC=mips64-linux-gnuabi64-gcc
$ export QEMU_LD_PREFIX=/usr/mips64-linux-gnuabi64
$ go run example.go
12 34


I was pleasantly surprised how well this all works.

One dimension

Despite the variety, all these architectures are still “running” the same
operating system, Linux, and so they only vary on one dimension. For most
programs primarily targeting x86-64 Linux, PowerPC Linux is practically
the same thing, while x86-64 OpenBSD is foreign territory despite sharing
an architecture and ABI (System V). Testing across operating
systems still requires spending the time to install, configure, and
maintain these extra hosts. That’s an article for another time.




strcpy: a niche function you don't need
2021-07-30T19:37:48Z
The C strcpy function is a common sight in typical C programs.
It’s also a source of buffer overflow defects, so linters and code
reviewers commonly recommend alternatives such as strncpy
(difficult to use correctly; mismatched semantics), strlcpy
(non-standard, flawed), or C11’s optional strcpy_s (no correct or
practical implementations). Besides their individual shortcomings,
these answers are incorrect. strcpy and friends are, at best, incredibly
niche, and the correct replacement is memcpy.

If strcpy is not easily replaced with memcpy then the code is
fundamentally wrong. Either it’s not using strcpy correctly or it’s
doing something dumb and should be rewritten. Highlighting such problems
is part of what makes memcpy such an effective replacement.

Note: Everything here applies just as much to strcat and
friends.

Clarification update: This article is about correctness (objective), not
safety (subjective). If the word “safety” comes to mind then you’ve missed
the point.

Common cases

Buffer overflows arise when the destination is smaller than the source.
Safe use of strcpy requires a priori knowledge of the length of the
source string length. Usually this knowledge is the exact source string
length. If so, memcpy is not only a trivial substitute, it’s faster
since it will not simultaneously search for a null terminator.

char *my_strdup(const char *s)
{
    size_t len = strlen(s) + 1;
    char *c = malloc(len);
    if (c) {
        strcpy(c, s);  // BAD
    }
    return c;
}

char *my_strdup_v2(const char *s)
{
    size_t len = strlen(s) + 1;
    char *c = malloc(len);
    if (c) {
        memcpy(c, s, len);  // GOOD
    }
    return c;
}


A more benign case is a static source string, i.e. trusted input.

struct err {
    char message[16];
};

void set_oom(struct err *err)
{
    strcpy(err->message, "out of memory");  // BAD
}


The size is a compile time constant, so exploit it as such! Even more, a
static assertion (C11) can catch mistakes at compile time rather than run
time.

void set_oom_v2(struct err *err)
{
    static const char oom[] = "out of memory";
    static_assert(sizeof(err->message) >= sizeof(oom));
    memcpy(err->message, oom, sizeof(oom));
}

// Or using a macro:

void set_oom_v3(struct err *err)
{
    #define OOM "out of memory"
    static_assert(sizeof(err->message) >= sizeof(OOM));
    memcpy(err->message, OOM, sizeof(OOM));
}

// Or assignment (implicit memcpy):

void set_oom_v4(struct err *err)
{
    static const struct err oom = {"out of memory"};
    *err = oom;
}


This covers the vast majority of cases of already-correct strcpy.

Less common cases

strcpy can still be correct without knowing the exact source string
length. It is enough to know its upper bound does not exceed the
destination length. In this example — assuming the input is guaranteed to
be null-terminated — this strcpy is correct without ever knowing the
source string length:

struct reply {
    char message[32];
    int x, y;
};

struct log {
    time_t timestamp;
    char message[32];
};

void log_reply(struct log *e, const struct reply *r)
{
    e->timestamp = time(0);
    strcpy(e->message, r->message);
}


This is a rare case where strncpy has the right semantics. It zeros out
unused destination bytes, destroying any previous contents.

    strncpy(e->message, r->message, sizeof(e->message));

    // In this case, same as:
    memset(e->message, 0, sizeof(e->message));
    strcpy(e->message, r->message);


It’s not a general strcpy replacement because strncpy might not write
a null terminator. If the source string does not null-terminate within the
destination length, then neither will destination string.

As before, we can do better with memcpy!

    static_assert(sizeof(e->message) >= sizeof(r->message));
    memcpy(e->message, r->message, sizeof(r->message));


This unconditionally copies 32 bytes. But doesn’t it waste time copying
bytes it won’t need? No! On modern hardware it’s far better to copy a
large, fixed number of bytes than a small, variable number of bytes. After
all, branching is expensive. Searching for and handling that null
terminator has a cost. This fixed-size copy is literally two instructions
on x86-64 (output of clang -march=x86-64-v3 -O3):

vmovups  ymm0, [rsi]
vmovups  [rdi + 8], ymm0


It’s faster and there’s no strcpy to attract complaints.

Niche cases

So where is strcpy useful? Only where all of the following apply:


  
    You only know the upper bound of the source string.
  
  
    It’s undesirable to read beyond that length. Maybe storage is limited
to the exact length of the string, or the upper bound is very large so
an unconditional copy is too expensive.
  
  
    The source string is so long, and the function so hot, that it’s worth
avoiding two passes: strlen followed by memcpy.
  


These circumstances are very unusual which makes strcpy a niche function
you probably don’t need. This is the best case I can imagine, and it’s
pretty dumb:

struct doc {
    unsigned long long id;
    char body[1L<<20];
};

// Create a new document from a buffer.
//
// If body is more than 1MiB, the behavior is undefined.
struct doc *doc_create(const char *body)
{
    struct doc *c = calloc(1, sizeof(*c));
    if (c) {
        c->id = id_gen();
        assert(strlen(body) < sizeof(c->body));
        strcpy(c->body, body);
    }
    return c;
}


If you’re dealing with such large null-terminated strings that (2) and (3)
apply then you’re already doing something fundamentally wrong and
self-contradictory. The pointer and length should be kept and passed
together. It’s especially essential for a hot function.

struct doc_v2 {
    unsigned long long id;
    size_t len;
    char body[];
};


Bonus: *_s isn’t helping you

C11 introduced “safe” string functions as an optional “Annex K”, each
named with a _s suffix to its “unsafe” counterpart. Here is the
prototype for strcpy_s:

errno_t strcpy_s(char *restrict s1,
                 rsize_t s1max,
                 const char *restrict s2);


The rsize_t is a size_t with a “restricted” range (RSIZE_MAX,
probably SIZE_MAX/2) intended to catch integer underflows. If you
accidentally compute a negative length, it will be a very large
number in unsigned form. (An indicator that size_t should have
originally been defined as signed.) This will be outside the
restricted range, and so the operation isn’t attempted due to a likely
underflow.

These “safe” functions were modeled after functions of the same name in
MSVC. However, as noted, there are no practical implementations of Annex
K. The functions in MSVC have different semantics and behavior, and they
do not attempt to implement the standard.

Worse, they don’t even do what’s promised in their documentation.
The following program should cause a runtime-constraint violation since
-1 is an invalid rsize_t in any reasonable implementation:

#define __STDC_WANT_LIB_EXT1__ 1
#include 
#include 

int main(void)
{
    char buf[8] = {0};
    errno_t r = strcpy_s(buf, -1, "hello");
    printf("%d %s\n", (int)r, buf);
}


With the latest MSVC as of this writing (VS 2019), this program prints “0
hello”. Using strcpy_s did not make my program any safer than had I
just used strcpy. If anything, it’s less safe due to a false sense of
security. Don’t use these functions.




More DLL fun with w64devkit: Go, assembly, and Python
2021-06-29T21:50:30Z
My previous article explained how to work with dynamic-link libraries
(DLLs) using w64devkit. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, w64devkit is a great complement to Go and reliably
fullfills all the needs of cgo — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.

Go: bootstrap and cgo

The primary Go implementation, confusingly named “gc”, is an
incredible piece of software engineering. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.

You will need the Go 1.4 source, go1.4-bootstrap-20171003.tar.gz.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).

Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use cmd.exe explicitly.

$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ && cmd /c make)


In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.

$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ && cmd /c make)


At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.

$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" >>~/.profile
$ source ~/.profile


Not only is Go now available, so is the full power of cgo. (Including its
costs if used.)

Vim suggestions

Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
goimports and a couple of corrections to Vim’s built-in Go support ([[
and ]] navigation). The included ctags understands Go, so tags
navigation works the same as it does with C. \i saves the current
buffer, runs goimports, and populates the quickfix list with any errors.
Similarly :make invokes go build and, as expected, populates the
quickfix list.

autocmd FileType go setlocal makeprg=go\ build
autocmd FileType go map <silent> <buffer> <leader>i
    \ :update \|
    \ :cexpr system("goimports -w " . expand("%")) \|
    \ :silent edit<cr>
autocmd FileType go map <buffer> [[
    \ ?^\(func\\|var\\|type\\|import\\|package\)\><cr>
autocmd FileType go map <buffer> ]]
    \ /^\(func\\|var\\|type\\|import\\|package\)\><cr>


Go only comes with gofmt but goimports is just one command away, so
there’s little excuse not to have it:

$ go install golang.org/x/tools/cmd/goimports@latest


Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.

cgo DLLs

The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
import "C". The imported C object provides access to C types and
functions. Go functions marked with an //export comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.

To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.

// Which DLL am I running?
int version(void);

// Generate 64 bits from a CSPRNG.
unsigned long long rand64(void);

// Compute the Euclidean norm.
float dist(float x, float y);


Here’s a C implementation which I’m calling “version 1”.

#include 
#include 
#include 

__declspec(dllexport)
int
version(void)
{
    return 1;
}

__declspec(dllexport)
unsigned long long
rand64(void)
{
    unsigned long long x;
    RtlGenRandom(&x, sizeof(x));
    return x;
}

__declspec(dllexport)
float
dist(float x, float y)
{
    return sqrtf(x*x + y*y);
}


As discussed in the previous article, each function is exported using
__declspec so that they’re available for import. As before:

$ cc -shared -Os -s -o hello1.dll hello1.c


Side note: This could be trivially converted into a C++ implementation
just by adding extern "C" to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.

Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:

package main

import "C"
import (
	"crypto/rand"
	"encoding/binary"
	"math"
)

//export version
func version() C.int {
	return 2
}

//export rand64
func rand64() C.ulonglong {
	var buf [8]byte
	rand.Read(buf[:])
	r := binary.LittleEndian.Uint64(buf[:])
	return C.ulonglong(r)
}

//export dist
func dist(x, y C.float) C.float {
	return C.float(math.Sqrt(float64(x*x + y*y)))
}

func main() {
}


Note the use of C types for all arguments and return values. The main
function is required since this is the main package, but it will never be
called. The DLL is built like so:

$ go build -buildmode=c-shared -o hello2.dll hello2.go


Without the -o option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.

What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using --out-implib. For Go we have to handle this ourselves via
gendef and dlltool.

$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def


The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.

NASM assembly DLL

We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.

w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has friendlier syntax. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.

bits 64

section .text

global DllMainCRTStartup
export DllMainCRTStartup
DllMainCRTStartup:
	mov eax, 1
	ret

global version
export version
version:
	mov eax, 3
	ret

global rand64
export rand64
rand64:
	rdrand rax
	ret

global dist
export dist
dist:
	mulss  xmm0, xmm0
	mulss  xmm1, xmm1
	addss  xmm0, xmm1
	sqrtss xmm0, xmm0
	ret


The global directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
export directive is Windows-specific and is equivalent to dllexport in
C.

Every DLL must have an entrypoint, usually named DllMainCRTStartup. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.

Here’s how to assemble and link the DLL:

$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o


Call the DLLs from Python

Python has a nice, built-in C interop, ctypes, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:

import ctypes

def load(version):
    hello = ctypes.CDLL(f"./hello{version}.dll")
    hello.version.restype = ctypes.c_int
    hello.version.argtypes = ()
    hello.dist.restype = ctypes.c_float
    hello.dist.argtypes = (ctypes.c_float, ctypes.c_float)
    hello.rand64.restype = ctypes.c_ulonglong
    hello.rand64.argtypes = ()
    return hello

for hello in load(1), load(2), load(3):
    print("version", hello.version())
    print("rand   ", f"{hello.rand64():016x}")
    print("dist   ", hello.dist(3, 4))


After loading the DLL with CDLL the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:

$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0


That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!




How to build and use DLLs on Windows
2021-05-31T02:13:40Z
I’ve recently been involved with a couple of discussions about Windows’
dynamic linking. One was Joe Nelson in considering how to make
libderp accessible on Windows, and the other was about w64devkit,
my Mingw-w64 distribution. I use these techniques so infrequently that I
need to figure it all out again each time I need it. Unfortunately there’s
a whole lot of outdated and incorrect information online which gets in the
way every time this happens. While it’s all fresh in my head, I will now
document what I know works.

In this article, all commands and examples are being run in the context of
w64devkit (1.8.0).

Mingw-w64

If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, square,
that returns the square of its input (square.c):

long square(long x)
{
    return x * x;
}


The header file (square.h):

#ifndef SQUARE_H
#define SQUARE_H

long square(long);

#endif


To build a stripped, size-optimized DLL, square.dll:

$ cc -shared -Os -s -o square.dll square.c


Now a test program to link against it (main.c), which “imports” square
from square.dll:

#include 
#include "square.h"

int main(void)
{
    printf("%ld\n", square(2));
}


Linking and testing it:

$ cc -Os -s main.c square.dll
$ ./a
4


It’s that simple. Or more traditionally, using the -l flag:

$ cc -Os -s -L. main.c -lsquare


Given -lxyz GCC will look for xyz.dll in the library path.

Viewing exported symbols

Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s nm -D, but despite what
the internet will tell you, this tool does not support DLLs. objdump
will print the exports as part of the “private” headers (-p). A bit of
awk can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, exports.sh, that composes objdump and
awk into the tool I want:

#!/bin/sh
set -e
printf 'LIBRARY %s\nEXPORTS\n' "$1"
objdump -p "$1" | awk '/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'


Running this on square.dll above:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square


This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
.def file format (also), which will be particularly
useful in a moment.

Mingw-w64 has a gendef tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:

$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square


Alternatively Visual Studio provides dumpbin. It’s not as concise as
exports.sh but it’s a lot less verbose than objdump -p.

$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...


Mingw-w64 (improved)

You can get by without knowing anything more, which is usually enough for
those looking to support Windows as a secondary platform, even just as a
cross-compilation target. However, with a bit more work we can do better.
Imagine doing the above with a non-trivial program. GCC doesn’t know which
functions are part of the API and which are not. Obviously static
functions should not be exported, but what about non-static functions
visible between translation units (i.e. object files)?

For instance, suppose square.c also has this function which is not part
of its API but may be called by another translation unit.

void internal_func(void) {}


Now when I build:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square


On the other side, when I build main.c how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).

On Windows both are solved by adding __declspec notation on both sides.
In square.c the exports are marked as dllexport:

__declspec(dllexport)
long square(long x)
{
    return x * x;
}

void internal_func(void) {}


In the header, it’s marked as an import:

__declspec(dllimport)
long square(long);


The mere presence of dllexport tells the linker to only export those
functions marked as exports, and so internal_func disappears from the
exports list. Convenient!

On the import side, during compilation of the original program, GCC
assumed square wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a PLT). With
dllimport, GCC knows it’s an imported function and so doesn’t go through
a trampoline.

While generally unnecessary for the GNU toolchain, it’s good hygiene to
use __declspec. It’s also mandatory when using MSVC, in case you
care about that as well.

MSVC

Mingw-w64-compiled DLLs will work with LoadLibrary out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (loadlib.c):

#include 
#include 

int main(void)
{
    HANDLE h = LoadLibrary("square.dll");
    long (*square)(long) = GetProcAddress(h, "square");
    printf("%ld\n", square(2));
}


Compiled with MSVC cl (via vcvars.bat):

$ cl /nologo loadlib.c
$ ./loadlib
4


However, the MSVC linker, unlike Binutils ld, cannot link directly with
DLLs. It requires an import library. Conventionally this matches the DLL
name but has a .lib extension — square.lib in this case. The Mingw-w64
ecosystem conventionally uses .dll.a, as in square.dll.a, in order to
distinguish it from a static library, but it’s the same format. The most
convenient way to get an import library is to ask GCC to generate one at
link-time via --out-implib:

$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c


Back to cl, just add square.lib as another input. You don’t actually
need square.dll present at link time.

$ cl /nologo /Os main.c square.lib
$ ./main
4


What if you already have the DLL and you just need an import library? GNU
Binutils’ dlltool can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a .def
file enumerating the exports. (Why?) What luck that we have a tool for
this!

$ ./exports.sh square.dll >square.def
$ dlltool --input-def square.def --output-lib square.lib


Reversing directions

Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with dllexport. The /LD (case
sensitive) is just like GCC’s -shared.

$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4


cl outputs three files: square.dll, square.lib, and square.exp.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.

This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at
least for C interfaces that don’t share CRT objects.

Tying it all together

If your program is designed to be portable, those __declspec will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.

The strategy will be to mark all API functions with SQUARE_API and
expand that to whatever is necessary at the time. When building a library,
it will expand to dllexport, or default visibility on unix-likes. When
consuming a library it will expand to dllimport, or nothing outside of
Windows. The new square.h:

#ifndef SQUARE_H
#define SQUARE_H

#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif

SQUARE_API
long square(long);

#endif


The new square.c:

#define SQUARE_BUILD
#include "square.h"

SQUARE_API
long square(long x)
{
    return x * x;
}


main.c remains the same. When compiling on unix-like systems, add the
-fvisibility=hidden to hide all symbols by default so that this macro
can reveal them.

$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4


Makefile ideas

While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different Makefiles per toolchain
and target. Aside from the SQUARE_API type of macros, the source code
can fortunately remain fairly agnostic about it.

Here’s what I might use as NMakefile for MSVC nmake:

CC     = cl /nologo
CFLAGS = /Os

all: main.exe square.dll square.lib

main.exe: main.c square.h square.lib
	$(CC) $(CFLAGS) main.c square.lib

square.dll: square.c square.h
	$(CC) /LD $(CFLAGS) square.c

square.lib: square.dll

clean:
	-del /f main.exe square.dll square.lib square.exp


Usage:

nmake /nologo /f NMakefile


For w64devkit and cross-compiling, Makefile.w64, which includes
import library generation for the sake of MSVC consumers:

CC      = cc
CFLAGS  = -Os
LDFLAGS = -s
LDLIBS  =

all: main.exe square.dll square.lib

main.exe: main.c square.dll square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c square.dll $(LDLIBS)

square.dll: square.c square.h
	$(CC) -shared -Wl,--out-implib,$(@:dll=lib) \
	    $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

square.lib: square.dll

clean:
	rm -f main.exe square.dll square.lib


Usage:

make -f Makefile.w64


And a Makefile for everyone else:

CC      = cc
CFLAGS  = -Os -fvisibility=hidden
LDFLAGS = -s
LDLIBS  =

all: main libsquare.so

main: main.c libsquare.so square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c ./libsquare.so $(LDLIBS)

libsquare.so: square.c square.h
	$(CC) -shared $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

clean:
	rm -f main libsquare.so


Now that I have this article, I’m glad I won’t have to figure this all out
again next time I need it!




A guide to Windows application development using w64devkit
2021-03-11T01:40:31Z
There’s a trend of building services where a monolithic application is
better suited, or using JavaScript and Python then being stumped by their
troublesome deployment story. This leads to solutions like bundling an
entire web browser with an application, or using containers to
circumscribe a sprawling dependency tree made of mystery meat.

My small development distribution for Windows, w64devkit,
is my own little way of pushing back against this trend where it affects
me most. Following in the footsteps of projects like Handmade Hero
and Making a Video Game from Scratch, this is my guide to
no-nonsense software development using my development kit. It’s an
overview of the tooling and development workflow, and I’ve tried not to
assume too much knowledge of the reader. Being a guide rather than manual,
it is incomplete on its own, and I link to substantial external resources
to fill in the gaps. The guide is capped with a small game I wrote
entirely using my development kit, serving as a demonstration of what
sorts of things are not only possible, but quite reasonably attainable.






Game repository: https://github.com/skeeto/asteroids-demo

Guide to source: Understanding Asteroids

Initial setup

Of course you cannot use the development kit if you don’t have it yet. Go
to the releases section and download the latest release. It will be
a .zip file named w64devkit-x.y.z.zip where x.y.z is the version.

You will need to unzip the development kit before using it. Windows has
built-in support for .zip files, so you can either right-click to access
“Extract All…” or navigate into it as a folder then drag-and-drop the
w64devkit directory somewhere outside the .zip file. It doesn’t care
where it’s unzipped (aka it’s “portable”), so put it where ever is
convenient: your desktop, user profile directory, a thumb drive, etc. You
can move it later if you change your mind just so long as you’re not
actively running it. If you decide you don’t need it anymore then delete
it.

Entering the development environment

There is a w64devkit.exe in the unzipped w64devkit directory. This is
the easiest way to enter the development environment, and will not require
system configuration changes. This program puts the kit’s programs in the
PATH environment variable then runs a Bourne shell — the standard unix
shell. Aside from the text editor, this is the primary interface for
developing software. In time you may even extend this environment with
your own tools.

If you want an additional “terminal” window, run w64devkit.exe again. If
you use it a lot, you may want to create a shortcut and even pin it to
your task bar.

Whether on Windows or unix-like systems, when you type a command into the
system shell it uses the PATH environment variable to locate the actual
program to run for that command. In practice, the PATH variable is a
concatenation of multiple directories, and the shell searches these
directories in order. On unix-like systems, PATH elements are separated
by colons. However, Windows uses colons to delimit drive letters, so its
PATH elements are separated by semicolons.

# Prepending to PATH on unix
PATH="$HOME/bin:$PATH"

# Prepending to PATH on Windows (w64devkit)
PATH="$HOME/bin;$PATH"


For more advanced users: Rather than use w64devkit.exe, you could “Edit
environment variables for your account” and manually add w64devkit’s bin
directory to your PATH, making the tools generally available everywhere
on your system. If you’ve gone this route, you can start a Bourne shell at
any time with sh -l. (The -l option requests a login shell.)

Also borrowed from the unix world is the concept of a home directory,
specified by the HOME environment variable. By default this will be your
user profile directory, typically C:/Users/$USER. Login shells always
start in the home directory. This directory is often indicated by tilde
(~), and many programs automatically expand a leading tilde to the home
directory.

Shell basics

The shell is a command interpreter. It’s named such because it was
originally a shell around the operating system kernel — the user
interface to the kernel. Your system’s graphical interface — Windows
Explorer, or Explorer.exe — is really just a kind of shell, too. That
shell is oriented around the mouse and graphics. This is fine for some
tasks, but a keyboard-oriented command shell is far better suited for
development tasks. It’s more efficient, but more importantly its features
are composable: Complex operations and processes can be constructed
from simple, easy-to-understand tools. Embrace it!

In the shell you can navigate between directories with cd, make
directories with mkdir, remove files with rm, regular expression text
searches with grep, etc. Run busybox to see a listing of the available
standard commands. Unfortunately there are no manual pages, but you can
access basic usage information for any command with busybox CMD --help.

Windows’ standard command shell is cmd.exe. Unfortunately this shell is
terrible and exists mostly for legacy compatibility. The intended
replacement is PowerShell for users who regularly use a shell. However,
PowerShell is fundamentally broken, does virtually everything incorrectly,
and manages to be even worse than cmd.exe. Besides, sticking to POSIX
shell conventions significantly improves build portability, and unix tool
knowledge is transferable to basically every other operating system.

Unix’s standard shell was the Bourne shell, sh. The shells in use today
are Bourne shell clones with a superset of its features. The most popular
interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist
shell) has become popular for non-interactive use (scripting). The shell
included with w64devkit is the BusyBox fork of the Almquist shell (ash),
closely related to dash. The Almquist shell has almost no non-interactive
features beyond the standard Bourne shell, and so as far as scripts are
concerned can be regarded as a plain Bourne shell clone. That’s why I
typically refer to it by the name sh.

However, BusyBox’s Almquist shell has interactive features much like Bash,
and Bash users should be quite comfortable. It’s not just tab-completion
but a slew of Emacs-like keybindings:


  Ctrl-r: search backwards in history
  Ctrl-s: search forwards in history
  Ctrl-p: previous command (Up)
  Ctrl-n: next command (Down)
  Ctrl-a: cursor to the beginning of line (Home)
  Ctrl-e: cursor to the end of line (End)
  Alt-b: cursor back one word
  Alt-f: cursor forward one word
  Ctrl-l: clear the screen
  Alt-d: delete word after the cursor
  Ctrl-w: delete the word before the cursor
  Ctrl-k: delete to the end of the line
  Ctrl-u: delete to the beginning of the line
  Ctrl-f: cursor forward one character (Right)
  Ctrl-b: cursor backward one character (Left)
  Ctrl-d: delete character under the cursor (Delete)
  Ctrl-h: delete character before the cursor (Backspace)


Take special note of Ctrl-r, which is the most important and powerful
shortcut of the bunch. Frequent use is a good habit. Don’t mash the up
arrow to search through the command history.

Special note for Cygwin and MSYS2 users: the shell is aware of Windows
paths and does not present a virtual unix file system scheme. This has
important consequences for scripting, both good and bad. The shell even
supports backslash as a directory separator, though you should of course
prefer forward slashes.

Shell customization

Login shells (-l) evaluate the contents of ~/.profile on startup. This
is your chance to customize the shell configuration, such as setting
environment variables or defining aliases and functions. For instance, if
you wanted the prompt to show the working directory in green you’d set
PS1 in your ~/.profile:

PS1="$(printf '\x1b[33;1m\\w\x1b[0m$ ')"


If you find yourself using the same command sequences or set of options
again and again, you might consider putting those commands into a script,
and then installing that script somewhere on your PATH so that you can
run it as a new command. First make a directory to hold your scripts, say
in ~/bin:

mkdir ~/bin


In ~/.profile prepend it to your PATH:

PATH="$HOME/bin;$PATH"


If you don’t want to start a fresh shell to try it out, then load the new
configuration in your current shell:

source ~/.profile


Suppose you keep getting the tar switches mixed up and you’d like to
just have an untar command that does the right thing. Create a file
named untar or untar.sh in ~/bin with these contents:

#!/bin/sh
set -e
tar -xaf "$@"


Now a command like untar something.tar.gz will extract the archive
contents.

To learn more about Bourne shell scripting, the POSIX shell command
language specification is a good reference. All of the features
listed in that document are available to your shell scripts.

Text editing

The development kit includes the powerful and popular text editor
Vim. It takes effort to learn, but is well worth the investment.
It’s packed with features, but since you only need a small number of them
on a regular basis it’s not as daunting as it might appear. Using Vim
effectively, you will write and edit text so much more quickly than
before. That includes not just code, but prose: READMEs, documentation,
etc.

(The catch: Non-modal editing will forever feel frustratingly inefficient.
That’s not because you will become unpracticed at it, or even have trouble
code switching between input styles, but because you’ll now be aware how
bad it is. Ignorance is bliss.)

Vim includes its own tutorial for absolute beginners which you can access
with the vimtutor command. It will run in the console window and guide
you through the basics in about half an hour. Do not be afraid to return
to the tutorial at any time since this is the stuff you need to know by
heart.

When it comes time to actually use Vim to write code, you can continue
writing code via the terminal interface (vim), or you can run the
graphical interface (gvim). The latter is recommended since it has some
nice quality-of-life features, but it’s not strictly necessary. When
starting the GUI, put an ampersand (&) on the command so that it runs in
the background. For instance this brings up the editor with two files open
but leaves the shell running in the foreground so you can continue using
it while you edit:

gvim main.c Makefile &


Vim’s defaults are good but imperfect. Before getting started with
actually editing code you should establish at least the following minimal
configuration in ~/_vimrc. (To understand these better, use :help to
jump the built-in documentation.)

set hidden encoding=utf-8 shellslash
filetype plugin indent on
syntax on


The graphical interface defaults to a white background. Many people prefer
“dark mode” when editing code, so inverting this is simply a matter of
choosing a dark color scheme. Vim comes with a handful of color schemes,
around half of which have dark backgrounds. Use :colorscheme to change
it, and put it in your ~/_vimrc to persist it.

colorscheme slate


The default graphical interface includes a menu bar and tool bar. There
are better ways to accomplish all these operations, none of which require
touching the mouse, so consider removing all that junk:

set guioptions=ac


Finally, since the development kit is oriented around C and C++, here’s my
own entire Vim configuration for C which makes it obey my own style:

set cinoptions+=t0,l1,:0 cinkeys-=0#


Once you’re comfortable with the basics, the best next step is to read
Practical Vim: Edit Text at the Speed of Thought by Drew Neil.
It’s an opinionated guide to Vim that instills good habits. If you want
something cost-free to whet your appetite, check out Seven habits of
effective text editing.

Writing an application

We’ve established a shell and text editor. Next is the development
workflow for writing an actual application. Ultimately you will invoke a
compiler from within Vim, which will parse compiler messages and take you
directly to the parts of your source code that need attention. Before we
get that far, let’s start with the basics.

The classic example is the “hello world” program, which we’ll suppose is
in a file called hello.c:

#include 

int main(void)
{
    puts("Hello, world!");
}


While this development kit provides a version of the GNU compiler, gcc,
this guide mostly speaks of it in terms of the generic unix C compiler
name, cc. Unix-like systems install cc as an alias for the system’s
default C compiler, and w64devkit is no exception.

cc -o hello.exe hello.c


This command creates hello.exe from hello.c. Since this is not (yet?)
on your PATH, you must invoke it via a path name (i.e. the command must
include a slash), since otherwise the shell will search for it via the
PATH variable. Typically this means putting ./ in front of the program
name, meaning “run the program in the current directory”. As a convenience
you do not need to include the .exe extension:

./hello


Unlike the untar shell script from before, this hello.exe is entirely
independent of w64devkit. You can share it with anyone running Windows and
they’ll be able to execute it. There’s a little bit of runtime embedded in
the executable, but the bulk of the runtime is in the operating system
itself. I want to highlight this point because most programming languages
don’t work like this, or at least doing so is unnatural with lots of
compromises. The users of your software do not need to install a runtime
or other supporting software. They just run the executable you give them!

That executable is probably pretty small, less than 50kB — basically a
miracle by today’s standards. Sure, it’s hardly doing anything right now,
but you can add a whole lot more functionality without that executable
getting much bigger. In fact, it’s entirely unoptimized right now and
could be even smaller. Passing the -Os flag tells the compiler to
optimize for size and -s flag tells the linker to strip out unneeded
information.

cc -Os -s -o hello.exe hello.c


That cuts the program down to around a third of its previous size. If
necessary you can still do even better than this, but that’s outside the
scope of this guide.

So far the program could still be valid enough to compile but contain
obvious mistakes. The compiler can warn about many of these mistakes, and
so it’s always worth enabling these warnings. This requires two flags:
-Wall (“all” warnings) and -Wextra (extra warnings).

cc -Wall -Wextra -o hello.exe hello.c


When you’re working on a program, you often don’t want optimization
enabled since it makes it more difficult to debug. However, some warnings
aren’t fired unless optimization is enabled. Fortunately there’s an
optimization level to resolve this, -Og (optimize for debugging).
Combine this with -g3 to embed debug information in the program. This
will be handy later.

cc -Wall -Wextra -Og -g3 -o hello.exe hello.c


These are the compiler flags you typically want to enable while developing
your software. When you distribute it, you’d use either -Os -s (optimize
for size) or -O3 -s (optimize for speed).

Makefiles

I mentioned running the compiler from Vim. This isn’t done directly but
via special build script called a Makefile. You invoke the make program
from Vim, which invokes the compiler as above. The simplest Makefile would
look like this, in a file literally named Makefile:

hello.exe: hello.c
    cc -Wall -Wextra -Og -g3 -o hello.exe hello.c


This tells make that the file named hello.exe is derived from another
file called hello.c, and the tab-indented line is the recipe for doing
so. Running the make command will run the compiler command if and only
if hello.c is newer than hello.exe.

To run make from Vim, use the :make command inside Vim. It will not
only run make but also capture its output in an internal buffer called
the quickfix list. If there is any warning or error, Vim will jump to
it. Use :cn (next) and :cp (prev) to move between issues and correct
them, or :cc to re-display the current issue. When you’re done fixing
the issues, run :make again to start the cycle over.

Try that now by changing the printed message and recompiling from within
Vim. Intentionally create an error (bad syntax, too many arguments, etc.)
and see what happens.

Makefiles are a powerful and conventional way to build C and C++ software.
Since the development kit includes the standard set of unix utilities,
it’s very easy to write portable Makefiles that work across a variety a
operating systems and environments. Your software isn’t necessarily tied
to Windows just because you’re using a Windows-based development
environment. If you want to learn how Makefiles work and how to use them
effectively, read A Tutorial on Portable Makefiles. From here on
I’ll assume you’ve read that tutorial.

Ultimately I’d probably write my “hello world” Makefile like so:

.POSIX:
CC      = cc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =
EXE     = .exe

hello$(EXE): hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)


When building a release, optimize for size or speed:

make CFLAGS=-Os LDFLAGS=-s


This is very much a Windows-first style of Makefile, but still allows it
to be comfortably used on other systems. On Linux this make invocation
strips away the .exe extension:

make EXE=


For a Windows-second Makefile, remove the line with EXE = .exe. This
allows EXE to come from the environment. So, for instance, I already
define the EXE environment variable in my w64devkit ~/.profile:

export EXE=.exe


On Linux running make does the right thing, as does running make on
Windows. No special configuration required.

If my software is truly limited to Windows, I’m likely still interested in
supporting cross-compilation. A common convention for GNU toolchains is a
CROSS Makefile macro. For example:

.POSIX:
CROSS   =
CC      = $(CROSS)gcc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =

hello.exe: hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)


On Windows I just run make, but on Linux I’d set CROSS appropriately.

make CROSS=x86_64-w64-mingw32-


Navigating

What happens if you’re working on a larger program and you need to jump to
the definition of a function, macro, or variable? It would be tedious to
use grep all the time to find definitions. The development kit includes
a solid implementation of ctags for building a tags database lists the
locations for various kinds of definitions, and Vim knows how to read this
database. Most often you’ll want to run it recursively like so:

ctags -R


You can of course do this from Vim, too: :!ctags -R

With the cursor over an identifier, press CTRL-] to jump to a definition
for that name. Use :tn and :tp to move between different definitions
(e.g. when the name is overloaded). Or if you have a tag in mind rather
than a name listed in the buffer, use the :tag command to jump by name.
Vim maintains a tag stack and jump list for going back and forth, like the
backward and forward buttons in a browser.

Debugging

I had mentioned that the -g3 option embeds extra information in the
executable. This is for debuggers, and the development kit includes the
GNU Debugger, gdb, to help you debug your programs. To use it, invoke
GDB on your executable:

gdb hello.exe


From here you can set breakpoints and such, then run the program with
start or run, then step through it line by line. See Beej’s Quick
Guide to GDB for a guide. During development, always run your
program through GDB, and never exit GDB. See also: Assertions should be
more debugger-oriented.

Learning C and C++

So far this guide hasn’t actually assumed any C knowledge. One of the best
ways to learn C is by reading the highly-regarded The C Programming
Language and doing the exercises. Alternatively, cost-free options
are Beej’s Guide to C Programming and Modern C (more
advanced). You can use the development kit to go through any of these.

I’ve focused on C, but everything above also applies to C++. To learn C++
A Tour of C++ is a safe bet.

Demonstration

To illustrate how much you can do with nothing beyond than this 76MB
development kit, here’s a taste in the form of a weekend project: an
Asteroids Clone for Windows. That’s the game in the video at the
top of this guide.

The development kit doesn’t include Git so you’d need to install it
separately in order to clone the repository, but you could at least skip
that and download a .zip snapshot of the source. It has no third-party
dependencies yet it includes hardware-accelerated graphics, real-time
sound mixing, and gamepad input. Building a larger and more complex game
is much less about tooling and more about time and skill. That’s what I
mean about w64devkit being (almost) everything you need.




Well-behaved alias commands on Windows
2021-02-08T20:32:45Z
Since its inception I’ve faced a dilemma with w64devkit, my
all-in-one Mingw-w64 toolchain and development environment
distribution for Windows. A major goal of the project is no
installation: unzip anywhere and it’s ready to go as-is. However, full
functionality requires alias commands, particularly for BusyBox applets,
and the usual solutions are neither available nor viable. It seemed that
an installer was needed to assemble this last puzzle piece. This past
weekend I finally discovered a tidy and complete solution that solves this
problem for good.

That solution is a small C source file, alias.c. This article is
about why it’s necessary and how it works.

Hard and symbolic links

Some alias commands are for convenience, such as a cc alias for gcc so
that build systems need not assume any particular C compiler. Others are
essential, such as an sh alias for “busybox sh” so that it’s available
as a shell for make. These aliases are usually created with links, hard
or symbolic. A GCC installation might include (roughly) a symbolic link
created like so:

ln -s gcc cc


BusyBox looks at its argv[0] on startup, and if it names an applet
(ls, sh, awk, etc.), it behaves like that applet. Typically BusyBox
aliases are installed as hard links to the original binary, and there’s
even a busybox --install to set these up. Both kinds of aliases are
cheap and effective.

ln busybox sh
ln busybox ls
ln busybox awk


Unfortunately links are not supported by .zip files on Windows. They’d
need to be created by a dedicated installer. As a result, I’ve strongly
recommended that users run “busybox --install” at some point to
establish the BusyBox alias commands. While w64devkit works without them,
it works better with them. Still, that’s an installation step!

An alternative option is to simply include a full copy of the BusyBox
binary for each applet — all 150 of them — simulating hard links. BusyBox
is small, around 4kB per applet on average, but it’s not quite that
small. Since the .zip format doesn’t use block compression — files are
compressed individually — this duplication will appear in the .zip itself.
My 573kB BusyBox build duplicated 150 times would double the distribution
size and increase the installation footprint by 25%. It’s not worth the
cost.

Since .zip is so limited, perhaps I should use a different distribution
format that supports links. However, another w64devkit goal is making no
assumptions about what other tools are installed. Windows natively
supports .zip, even if that support isn’t so great (poor performance, low
composability, missing features, etc.). With nothing more than the
w64devkit .zip on a fresh, offline Windows installation, you can begin
efficiently developing professional, native applications in under a
minute.

Scripts as aliases

With links off the table, the next best option is a shell script. On
unix-like systems shell scripts are an effective tool for creating complex
alias commands. Unlike links, they can manipulate the argument list. For
instance, w64devkit includes a c99 alias to invoke the C compiler
configured to use the C99 standard. To do this with a shell script:

#!/bin/sh
exec cc -std=c99 "$@"


This prepends -std=c99 to the argument list and passes through the rest
untouched via the Bourne shell’s special case "$@". Because I used
exec, the shell process becomes the compiler in place. The shell
doesn’t hang around in the background. It’s just gone. This really quite
elegant and powerful.

The closest available on Windows is a .bat batch file. However, like some
other parts of DOS and Windows, the Batch language was designed as though
its designer once glimpsed at someone using a unix shell, perhaps looking
over their shoulder, then copied some of the ideas without understanding
them. As a result, it’s not nearly as useful or powerful. Here’s the Batch
equivalent:

@cc -std=c99 %*


The @ is necessary because Batch prints its commands by default (Bourne
shell’s -x option), and @ disables it. Windows lacks the concept of
exec(3), so Batch file interpreter cmd.exe continues running alongside
the compiler. A little wasteful but that hardly matters. What does matter
though is that cmd.exe doesn’t behave itself! If you, say, Ctrl+C to
cancel compilation, you will get the infamous “Terminate batch job (Y/N)?”
prompt which interferes with other programs running in the same console.
The so-called “batch” script isn’t a batch job at all: It’s interactive.

I tried to use Batch files for BusyBox applets, but this issue came up
constantly and made this approach impractical. Nearly all BusyBox applets
are non-interactive, and lots of things break when they aren’t. Worst of
all, you can easily end up with layers of cmd.exe clobbering each other
to ask if they should terminate. It was frustrating.

The prompt is hardcoded in cmd.exe and cannot be disabled. Since so much
depends on cmd.exe remaining exactly the way it is, Microsoft will never
alter this behavior either. After all, that’s why they made PowerShell a
new, separate tool.

Speaking of PowerShell, could we use that instead? Unfortunately not:


  
    It’s installed by default on Windows, but is not necessarily enabled.
One of my own use cases for w64devkit involves systems where PowerShell
is disabled by policy. A common policy is it can be used interactively
but not run scripts (“Running scripts is disabled on this system”).
  
  
    PowerShell is not a first class citizen on Windows, and will likely
never be. Even under the friendliest policy it’s not normally possible
to put a PowerShell script on the PATH and run it by name. (I’m sure
there are ways to make this work via system-wide configuration, but
that’s off the table.)
  
  
    Everything in PowerShell is broken. For example, it does not support
input redirection with files, and instead you must use the cat-like
command, Get-Content, to pipe file contents. However, Get-Content
translates its input and quietly damages your data. There is no way to
disable this “feature” in the version of PowerShell that ships with
Windows, meaning it cannot accomplish the simplest of tasks. This is
just one of many ways that PowerShell is broken beyond usefulness.
  


Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts
are still not first class citizens since Windows doesn’t know what to do
with them. Fixing would require system-wide configuration, antithetical to
the philosophy of the project.

Solution: compiled shell “scripts”

My working solution is inspired by an insanely clever hack used by my
favorite media player, mpv. The Windows build is strange at first
glance, containing two binaries, mpv.exe (large) and mpv.com (tiny).
Is that COM as in an old-school 16-bit DOS binary? No, that’s just
a trick that works around a Windows limitation.

The Windows technology is broken up into subsystems. Console programs run
in the Console subsystem. Graphical programs run in the Windows subsystem.
The original WSL was a subsystem. Unfortunately this design means
that a program must statically pick a subsystem, hardcoded into the binary
image. The program cannot select a subsystem dynamically. For example,
this is why Java installations have both java.exe and javaw.exe, and
Emacs has emacs.exe and runemacs.exe. Different binaries for different
subsystems.

On Linux, a program that wants to do graphics just talks to the Xorg
server or Wayland compositor. It can dynamically choose to be a terminal
application or a graphical application. Or even both at once. This is
exactly the behavior of mpv, and it faces a dilemma on Windows: With
subsystems, how can it be both?

The trick is based on the environment variable PATHEXT which tells
Windows how to prioritize executables with the same base name but
different file extensions. If I type mpv and it finds both mpv.exe and
mpv.com, which binary will run? It will be the first listed in
PATHEXT, and by default that starts with:

PATHEXT=.COM;.EXE;.BAT;...


So it will run mpv.com, which is actually a plain old PE+ .exe
in disguise. The Windows subsystem mpv.exe gets the shortcut and file
associations while Console subsystem mpv.com catches command line
invocations and serves as console liaison as it invokes the real
mpv.exe. Ingenious!

I realized I can pull a similar trick to create command aliases — not the
.com trick, but the miniature flagger program. If only I could compile
each of those Batch files to tiny, well-behaved .exe files so that it
wouldn’t rely on the badly-behaved cmd.exe…

Tiny C programs

Years ago I wrote about tiny, freestanding Windows executables.
That research paid off here since that’s exactly what I want. The alias
command program need only manipulate its command line, invoke another
program, then wait for it to finish. This doesn’t require the C library,
just a handful of kernel32.dll calls. My alias command programs can be
so small that would no longer matter that I have 150 of them, and I get
complete control over their behavior.

To compile, I use -nostdlib and -ffreestanding to disable all system
libraries, -lkernel32 to pull that one back in, -Os (optimize for
size), and -s (strip) all to make the result as small as possible.

I don’t want to write a little program for each alias command. Instead
I’ll use a couple of C defines, EXE and CMD, to inject the target
command at compile time. So this Batch file:

@target arg1 arg2 %*


Is equivalent to this alias compilation:

gcc -DEXE="target.exe" -DCMD="target arg1 arg2" \
    -s -Os -nostdlib -ffreestanding -o alias.exe alias.c -lkernel32


The EXE string is the actual module name, so the .exe extension is
required. The CMD string replaces the first complete token of the
command line string (think argv[0]) and may contain arbitrary additional
arguments (e.g. -std=c99). Both are handled as wide strings (L"...")
since the alias program uses the wide Win32 API in order to be fully
transparent. Though unfortunately at this time it makes no difference: All
currently aliased programs use the “ANSI” API since the underlying C and
C++ standard libraries only use the ANSI API. (As far as I know, nobody
has ever written fully-functional C and C++ standard libraries for
Windows, not even Microsoft.)

You might wonder why the heck I’m gluing strings together for the
arguments. These will need to be parsed (word split, etc.) by someone
else, so shouldn’t I construct an argv array instead? That’s not how it
works on Windows: Programs receive a flat command string and are expected
to parse it themselves following the format specification. When
you write a C program, the C runtime does this for you to provide the
usual argv array.

This is upside down. The caller creating the process already has arguments
split into an argv array — or something like it — but Win32 requires the
caller to encode the argv array as a string following a special format so
that the recipient can immediately decode it. Why marshaling rather than
pass structured data in the first place? Why does Win32 only supply a
decoder (CommandLineToArgv) and not an encoder (e.g. the missing
ArgvToCommandLine)? Hey, I don’t make the rules; I just have to live
with them.

You can look at the original source for the details, but the summary is
that I supply my own xstrlen(), xmemcpy(), and partial Win32 command
line parser — just enough to identify the first token, even if that token
is quoted. It glues the strings together, calls CreateProcessW, waits
for it to exit (WaitForSingleObject), retrieves the exit code
(GetExitCodeProcess), and exits with the same status. (The stuff that
comes for free with exec(3).)

This all compiles to a 4kB executable, mostly padding, which is small
enough for my purposes. These compress to an acceptable 1kB each in the
.zip file. Smaller would be nicer, but this would require at minimum a
custom linker script, and even smaller would require hand-crafted
assembly.

This lingering issue solved, w64devkit now works better than ever. The
alias.c source is included in the kit in case you need to make any of
your own well-behaved alias commands.




Single-primitive authenticated encryption for fun
2021-01-30T03:39:10Z
Just as a fun exercise, I designed and implemented from scratch a
standalone, authenticated encryption tool, including key derivation with
stretching, using a single cryptographic primitive. Or, more specifically,
half of a primitive. That primitive is the encryption function of the
XXTEA block cipher. The goal was to pare both design and
implementation down to the bone without being broken in practice — I
hope — and maybe learn something along the way. This article is the tour
of my design. Everything here will be nearly the opposite of the right
answers.

The tool itself is named xxtea (lowercase), and it’s supported
on all unix-like and Windows systems. It’s trivial to compile, even on
the latter. The code should be easy to follow from top to bottom,
with commentary about specific decisions along the way, though I’ll quote
the most important stuff inline here.

The command line options follow the usual conventions. The two
modes of operation are encrypt (-E) and decrypt (-D). It defaults to
using standard input and standard output so it works great in pipelines.
Supplying -o sends output elsewhere (automatically deleted if something
goes wrong), and the optional positional argument indicates an alternate
input source.

usage: xxtea <-E|-D> [-h] [-o FILE] [-p PASSWORD] [FILE]

examples:
    $ xxtea -E -o file.txt.xxtea file.txt
    $ xxtea -D -o file.txt file.txt.xxtea


If no password is provided (-p), it prompts for a UTF-8-encoded
password. Of course it’s not normally a good idea to supply a
password via command line argument, but it’s been useful for testing.

XXTEA block cipher

TEA stands for Tiny Encryption Algorithm and XXTEA is the second attempt
at fixing weaknesses in the cipher — with partial success. The remaining
issues should not be an issue for this particular application. XXTEA
supports a variable block size, but I’ve hardcoded my implementation to a
128-bit block size, along with some unrolling. I’ve also discarded the
unneeded decryption function. There are no data-dependent lookups or
branches so it’s immune to speculation attacks.

XXTEA operates on 32-bit words and has a 128-bit key, meaning both block
and key are four words apiece. My implementation is about a dozen lines
long. Its prototype:

// Encrypt a 128-bit block using 128-bit key
void xxtea128_encrypt(const uint32_t key[4], uint32_t block[4]);


All cryptographic operations are built from this function. Another way to
think about it is that it accepts two 128-bit inputs and returns a 128-bit
result:

uint128 r = f(uint128 key, uint128 block);


Tuck that away in the back of your head since this will be important
later.

Encryption

If I tossed the decryption function, how are messages decrypted? I’m sure
many have already guessed: XXTEA will be used in counter mode, or CTR
mode. Rather than encrypt the plaintext directly, encrypt a 128-bit block
counter and treat it like a stream cipher. The message is XORed with the
encrypted counter values for both encryption and decryption.


  Only half the cipher is needed.
  No padding scheme is necessary. With other block modes, if message
lengths may not be exactly a multiple of the block size then you need
some scheme for padding the last block.


A 128-bit increment with 32-bit limbs is easy:

void
increment(uint32_t ctr[4])
{
    /* 128-bit increment, first word changes fastest */
    if (!++ctr[0]) if (!++ctr[1]) if (!++ctr[2]) ++ctr[3];
}


In xxtea, words are always marshalled in little endian byte order (least
significant byte first). With the first word as the least significant
limb, the entire 128-bit counter is itself little endian.

The counter doesn’t start at zero, but at some randomly-selected 128-bit
nonce called the initialization vector (IV), wrapping around to zero if
necessary (incredibly unlikely). The IV will be included with the message
in the clear. This nonce allows one key (password) to be used with
multiple messages, as they’ll all be encrypted using different,
randomly-chosen regions of an enormous keystream. It also provides
semantic security: encrypt the same file more than once and the
ciphertext will always be completely different.

for (/* ... */) {
    uint32_t cover[4] = {ctr[0], ctr[1], ctr[2], ctr[3]};
    xxtea128_encrypt(key, cover);
    block[i+0] ^= cover[0];
    block[i+1] ^= cover[1];
    block[i+2] ^= cover[2];
    block[i+3] ^= cover[3];
    increment(ctr);
}


Hash function

That’s encryption, but there’s still a matter of authentication and key
derivation function (KDF). To deal with both I’ll need to devise a hash
function. Since I’m only using the one primitive, somehow I need to build
a hash function from a block cipher. Fortunately there’s a tool for doing
just that: the Merkle–Damgård construction.

Recall that xxtea128_encrypt accepts two 128-bit inputs and returns a
128-bit result. In other words, it compresses 256 bits into 128 bits: a
compression function. The two 128-bit inputs are cryptographically
combined into one 128-bit result. I can repeat this operation to fold an
arbitrary number of 128-bit inputs into a 128-bit hash result.

uint32_t *input = /* ... */;
uint32_t hash[4] = {0, 0, 0, 0};
xxtea128_encrypt(input +  0, hash);
xxtea128_encrypt(input +  4, hash);
xxtea128_encrypt(input +  8, hash);
xxtea128_encrypt(input + 12, hash);
// ...


Note how the input is the key, not the block. The hash state is repeatedly
encrypted using the hash inputs as the key, mixing hash state and input.
When the input is exhausted, that block is the result. Sort of.

I used zero for the initial hash state in my example, but it will be more
challenging to attack if the starting input is something random. Like
Blowfish, in xxtea I chose the first 128 bits of the decimals
of pi:

void
xxtea128_hash_init(uint32_t ctx[4])
{
    /* first 32 hexadecimal digits of pi */
    ctx[0] = 0x243f6a88; ctx[1] = 0x85a308d3;
    ctx[2] = 0x13198a2e; ctx[3] = 0x03707344;
}

/* Mix one block into the hash state. */
void
xxtea128_hash_update(uint32_t ctx[4], const uint32_t block[4])
{
    xxtea128_encrypt(block, ctx);
}


There are still a couple of problems. First, what if the input isn’t a
multiple of the block size? This time I do need a padding scheme to fill
out that last block. In this case I pad it with bytes where each byte is
the number of padding bytes. For instance, helloworld becomes, roughly
speaking, helloworld666666.

That creates a different problem: This will have the same hash result as
an input that actually ends with these bytes. So the second rule is that
there is always a padding block, even if that block is 100% padding.

Another problem is that the Merkle–Damgård construction is prone to
length-extension attacks. Anyone can take my hash result and continue
appending additional data without knowing what came before. If I’m using
this hash to authenticate the ciphertext, someone could, for example, use
this attack to append arbitrary data to the end of messages.

Some important hash functions, such as the most common forms of SHA-2, are
vulnerable to length-extension attacks. Keeping this issue in mind, I
could address it later using HMAC, but I have an idea for nipping this in
the bud now. Before mixing the padding block into the hash state, I swap
the two middle words:

/* Append final raw-byte block to hash state. */
void
xxtea128_hash_final(uint32_t ctx[4], const void *buf, int len)
{
    assert(len < 16);
    unsigned char tmp[16];
    memset(tmp, 16-len, 16);
    memcpy(tmp, buf, len);
    uint32_t k[4] = {
        loadu32(tmp +  0), loadu32(tmp +  4),
        loadu32(tmp +  8), loadu32(tmp + 12),
    };
    /* swap middle words to break length extension attacks */
    uint32_t swap = ctx[1];
    ctx[1] = ctx[2];
    ctx[2] = swap;
    xxtea128_encrypt(k, ctx);
}


This operation “ties off” the last block so that the hash can’t be
extended with more input. Or so I hope. This is my own invention, and so
it may not actually work right. Again, this is for fun and learning!

Update: Aristotle Pagaltzis pointed out that when these two words are
identical the hash result will be unchanged, leaving it vulnerable to
length extension attack. This occurs about once every 2³²
messages, which is far too small a security margin.

Caveats

Despite all that care, there are still two more potential weaknesses.

First, XXTEA was never designed to be used with the Merkle–Damgård
construction. I assume attackers can modify files I will decrypt, and so
the hash input is usually and mostly under control of attackers, meaning
they control the cipher key. Ciphers are normally designed assuming the
key is not under hostile control. This might be vulnerable to related-key
attacks.

As will be discussed below, I use this custom hash function in two ways.
In one the input is not controlled by attackers, so this is a non-issue.
In the second, the hash state is completely unknown to the attacker before
they control the input, which I believe mitigates any issues.

Second, a 128-bit hash state is a bit small these days. For very large
inputs, the chance of collision via the birthday paradox is a
practical issue.

In xxtea, digests are only computed over a few megabytes of input at a time
at most, even when encrypting giant files, so a 128-bit state should be
fine.

Key derivation

The user will supply a password and somehow I need to turn that into a
128-bit key.


  What if the password is shorter than 128 bits?
  What if the password is longer than 128 bits?
  It’s safer for the cipher if the raw password isn’t used directly.
  I’d like offline, brute force attacks to be expensive.


The first three can be resolved by running the passphrase through the hash
function, using it as key derivation function. What about the last item?
Rather than hash the password once, I concatenate it, including null
terminator, repeatedly until it reaches a certain number of bytes
(hardcoded to 64 MiB, see COST), and hash that. That’s a computational
workload that attackers must repeat when guessing passwords.

To avoid timing attacks based on the password length, I precompute all
possible block arrangements before starting the hash — all the different
ways the password might appear concatenated across 16-byte blocks. Blocks
may be redundantly computed if necessary to make this part constant time.
The hash is fed entirely from these precomputed blocks.

To defend against rainbow tables and the like — as well as make it harder
to attack other parts of the message construction — the initialization
vector is used as a salt, fed into the hash before the password
concatenation.

Unfortunately this KDF isn’t memory-hard, and attackers can use economy
of scale to strengthen their attacks (GPUs, custom hardware). However, a
memory-hard KDF requires lots of memory to compute the key, making memory
an expensive and limiting factor for attackers. Memory-hard KDFs are
complex and difficult to design, and I made the trade-off for simplicity.

Authentication

When I say the encryption is authenticated I mean that it should not be
possible for anyone to tamper with the ciphertext undetected without
already knowing the key. This is typically accomplished by computing a
keyed hash digest and appending it to the message, message authentication
code (MAC). Since it’s keyed, only someone who knows the key can compute
the digest, and so attackers can’t spoof the MAC.

This is where length-extension attacks come into play: With an improperly
constructed MAC, an attacker could append input without knowing the key.
Fortunately my hash function isn’t vulnerable to length-extension attacks!

An alternative is to use an authenticated block mode such as GCM,
which is still CTR mode at its core. Unfortunately, this is complicated,
and, unlike plain CTR, it would take me a long time to convince myself I
got it right. So instead I used CTR mode and my hash function in a
straightforward way.

At this point there’s a question of what exactly you input into the hash
function. Do you hash the plaintext or do you hash the ciphertext? It’s
tempting to do the former since it’s (generally) not available to
attackers, and would presumably make it harder to attack. This is a
mistake. Always compute the MAC over the ciphertext, a.k.a. encrypt then
authenticate.

This is the called the Doom Principle. Computing the MAC on the
plaintext means that recipients must decrypt untrusted ciphertext before
authenticating it. This is bad because messages should be authenticated
before decryption. So that’s exactly what xxtea does. It also happens to
be the simplest option.

We have a hash function, but to compute a MAC we need a keyed hash
function. Again, I do the simplest thing that I believe isn’t broken:
concatenate the key with the ciphertext. Or more specifically:

MAC = hash(key || ctr || ciphertext)


Update: Dimitrije Erdeljan explains why this is broken and
how to fix it. Given a valid MAC, attackers can forge arbitrary messages.

The counter is because xxtea uses chunked authentication with one megabyte
chunks. It can authenticate a chunk at a time, which allows it to decrypt,
with authentication, arbitrary amounts of ciphertext in a fixed amount of
memory. The worst that can happen is truncation between chunks — an
acceptable trade-off. The counter ensures each chunk MAC is uniquely
keyed, that they appear in order.

It’s also important to note that the counter is appended after the key.
The counter is under hostile control — they can choose the IV — and having
the key there first means they have no information about the hash state.

All chunks are one megabyte except for the last chunk, which is always
shorter, signaling the end of the message. It may even be just a MAC and
zero-length ciphertext. This avoids nasty issues with parsing potentially
unauthenticated length fields and whatnot. Just stop successfully at the
first short, authenticated chunk.

Some will likely have spotted it, but a potential weakness is that I’m
using the same key for both encryption and authentication. These are
normally two different keys. This is disastrous in certain cases like
CBC-MAC, but I believe it’s alright here. It would be easy to
compute a separate MAC key, but I opted for simple.

File format

In my usual style, encrypted files have no distinguishing headers or
fields. They just look like a random block of data. A file begins with the
16-byte IV, then a sequence of zero or more one megabyte chunks, ending
with a short chunk. It’s indistinguishable from /dev/random.

[IV][lMiB || MAC][1MiB || MAC][<1 MiB || MAC]


If the user types the incorrect password, it will be discovered when
authenticating the first chunk (read: immediately). This saves on a
dedicated check at the beginning of the file, though it means it’s not
possible to distinguish between a bad password and a modified file.

I know my design has weaknesses as a result of artificial, self-imposed
constraints and deliberate trade-offs, but I’m curious if I’ve made any
glaring mistakes with practical consequences.




State machines are wonderful tools
2020-12-31T22:48:13Z
This article was discussed on Hacker News.

I love when my current problem can be solved with a state machine. They’re
fun to design and implement, and I have high confidence about correctness.
They tend to:


  Present minimal, tidy interfaces
  Require few, fixed resources
  Hold no opinions about input and output
  Have a compact, concise implementation
  Be easy to reason about


State machines are perhaps one of those concepts you heard about in
college but never put into practice. Maybe you use them regularly.
Regardless, you certainly run into them regularly, from regular
expressions to traffic lights.



Morse code decoder state machine

Inspired by a puzzle, I came up with this deterministic state
machine for decoding Morse code. It accepts a dot ('.'), dash
('-'), or terminator (0) one at a time, advancing through a state
machine step by step:

int morse_decode(int state, int c)
{
    static const unsigned char t[] = {
        0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
        0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
        0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
        0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
        0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
        0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
        0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
        0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
    };
    int v = t[-state];
    switch (c) {
    case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
    case 0x2e: return v &  2 ? state*2 - 1 : 0;
    case 0x2d: return v &  1 ? state*2 - 2 : 0;
    default:   return 0;
    }
}


It typically compiles to under 200 bytes (table included), requires only a
few bytes of memory to operate, and will fit on even the smallest of
microcontrollers. The full source listing, documentation, and
comprehensive test suite:

https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c

The state machine is trie-shaped, and the 100-byte table t is the static
encoding of the Morse code trie:



Dots traverse left, dashes right, terminals emit the character at the
current node (terminal state). Stopping on red nodes, or attempting to
take an unlisted edge is an error (invalid input).

Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of t), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as a binary heap in an
array: the left and right children of the node at i are found at
i*2+1 and i*2+2. No need to waste memory storing edges!

Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No c input
will produce a bad state.

In the original problem the terminals were missing. Despite being a state
machine, morse_decode is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.

UTF-8 decoder state machine

The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible
and Economical UTF-8 Decoder. It packs the entire state machine into
a relatively small table using clever tricks. It’s easily my favorite
UTF-8 decoder.

I wanted to try my own hand at it, so I re-derived the same canonical
UTF-8 automaton:



Then I encoded this diagram directly into a much larger (2,064-byte), less
elegant table, too large to display inline here:

https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c

However, the trade-off is that the executable code is smaller, faster, and
branchless again (by accident, I swear!):

int utf8_decode(int state, long *cp, int byte)
{
    static const signed char table[8][256] = { /* ... */ };
    static const unsigned char masks[2][8] = { /* ... */ };
    int next = table[state][byte];
    *cp = (*cp << 6) | (byte & masks[!state][next&7]);
    return next;
}


Like Bjoern’s decoder, there’s a code point accumulator. The real state
machine has 1,109,950 terminal states, and many more edges and nodes. The
accumulator is an optimization to track exactly which edge was taken to
which node without having to represent such a monstrosity.

Despite the huge table I’m pretty happy with it.

Word count state machine

Here’s another state machine I came up with awhile back for counting words
one Unicode code point at a time while accounting for Unicode’s various
kinds of whitespace. If your input is bytes, then plug this into the above
UTF-8 state machine to convert bytes to code points! This one uses a
switch instead of a lookup table since the table would be sparse (i.e.
let the compiler figure it out).

/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */
long word_count(long state, long codepoint)
{
    switch (codepoint) {
    case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
    case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
    case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
    case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
    case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
        return state < 0 ? -state : state;
    default:
        return state < 0 ? state : -1 - state;
    }
}


I’m particularly happy with the edge-triggered state transition
mechanism. The sign of the state tracks whether the “signal” is “high”
(inside of a word) or “low” (outside of a word), and so it counts rising
edges.



The counter is not technically part of the state machine — though it
eventually overflows for practical reasons, it isn’t really “finite” — but
is rather an external count of the times the state machine transitions
from low to high, which is the actual, useful output.

Reader challenge: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the switch (chain of branches, jump table?).

Coroutines and generators as state machines

In languages that support them, state machines can be implemented using
coroutines, including generators. I do particularly like the idea of
compiler-synthesized coroutines as state machines, though this is a
rare treat. The state is implicit in the coroutine at each yield, so the
programmer doesn’t have to manage it explicitly. (Though often that
explicit control is powerful!)

Unfortunately in practice it always feels clunky. The following implements
the word count state machine (albeit in a rather un-Pythonic way). The
generator returns the current count and is continued by sending it another
code point:

WHITESPACE = {
    0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
    0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
    0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
    0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
    0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}

def wordcount():
    count = 0
    while True:
        while True:
            # low signal
            codepoint = yield count
            if codepoint not in WHITESPACE:
                count += 1
                break
        while True:
            # high signal
            codepoint = yield count
            if codepoint in WHITESPACE:
                break


However, the generator ceremony dominates the interface, so you’d probably
want to wrap it in something nicer — at which point there’s really no
reason to use the generator in the first place:

wc = wordcount()
next(wc)  # prime the generator
wc.send(ord('A'))  # => 1
wc.send(ord(' '))  # => 1
wc.send(ord('B'))  # => 2
wc.send(ord(' '))  # => 2


Same idea in Lua, which famously has full coroutines:

local WHITESPACE = {
    [0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
    [0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
    [0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
    [0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
    [0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
    [0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
    [0x3000]=true
}

function wordcount()
    local count = 0
    while true do
        while true do
            -- low signal
            local codepoint = coroutine.yield(count)
            if not WHITESPACE[codepoint] then
                count = count + 1
                break
            end
        end
        while true do
            -- high signal
            local codepoint = coroutine.yield(count)
            if WHITESPACE[codepoint] then
                break
            end
        end
    end
end


Except for initially priming the coroutine, at least coroutine.wrap()
hides the fact that it’s a coroutine.

wc = coroutine.wrap(wordcount)
wc()  -- prime the coroutine
wc(string.byte('A'))  -- => 1
wc(string.byte(' '))  -- => 1
wc(string.byte('B'))  -- => 2
wc(string.byte(' '))  -- => 2


Extra examples

Finally, a couple more examples not worth describing in detail here. First
a Unicode case folding state machine:

https://github.com/skeeto/scratch/blob/master/misc/casefold.c

It’s just an interface to do a lookup into the official case folding
table. It was an experiment, and I probably wouldn’t use it in a
real program.

Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s
not obvious from the interface, but internally it’s just a state machine
for both encoder and decoder, which is what it allows it to “pause”
between any pair of input/output bytes.




You might not need machine learning
2020-11-24T04:04:36Z
This article was discussed on Hacker News.

Machine learning is a trendy topic, so naturally it’s often used for
inappropriate purposes where a simpler, more efficient, and more reliable
solution suffices. The other day I saw an illustrative and fun example of
this: Neural Network Cars and Genetic Algorithms. The video
demonstrates 2D cars driven by a neural network with weights determined by
a generic algorithm. However, the entire scheme can be replaced by a
first-degree polynomial without any loss in capability. The machine
learning part is overkill.





Above demonstrates my implementation using a polynomial to drive the cars.
My wife drew the background. There’s no path-finding; these cars are just
feeling their way along the track, “following the rails” so to speak.

My intention is not to pick on this project in particular. The likely
motivation in the first place was a desire to apply a neural network to
something. Many of my own projects are little more than a vehicle to try
something new, so I can sympathize. Though a professional setting is
different, where machine learning should be viewed with a more skeptical
eye than it’s usually given. For instance, don’t use active learning to
select sample distribution when a quasirandom sequence will do.

In the video, the car has a limited turn radius, and minimum and maximum
speeds. (I’ve retained these contraints in my own simulation.) There are
five sensors — forward, forward-diagonals, and sides — each sensing the
distance to the nearest wall. These are fed into a 3-layer neural network,
and the outputs determine throttle and steering. Sounds pretty cool!



A key feature of neural networks is that the outputs are a nonlinear
function of the inputs. However, steering a 2D car is simple enough that
a linear function is more than sufficient, and neural networks are
unnecessary. Here are my equations:

steering = C0*input1 - C0*input3
throttle = C1*input2


I only need three of the original inputs — forward for throttle, and
diagonals for steering — and the driver has just two parameters, C0 and
C1, the polynomial coefficients. Optimal values depend on the track
layout and car configuration, but for my simulation, most values above 0
and below 1 are good enough in most cases. It’s less a matter of crashing
and more about navigating the course quickly.

The lengths of the red lines below are the driver’s three inputs:




These polynomials are obviously much faster than a neural network, but
they’re also easy to understand and debug. I can confidently reason about
the entire range of possible inputs rather than worry about a trained
neural network responding strangely to untested inputs.

Instead of doing anything fancy, my program generates the coefficients at
random to explore the space. If I wanted to generate a good driver for a
course, I’d run a few thousand of these and pick the coefficients that
complete the course in the shortest time. For instance, these coefficients
make for a fast, capable driver for the course featured at the top of the
article:

C0 = 0.896336973, C1 = 0.0354805067


Many constants can complete the track, but some will be faster than
others. If I was developing a racing game using this as the AI, I’d not
just pick constants that successfully complete the track, but the ones
that do it quickly. Here’s what the spread can look like:




If you want to play around with this yourself, here’s my C source code
that implements this driving AI and generates the videos and images
above:

aidrivers.c

Racetracks are just images drawn in your favorite image editing program
using the colors documented in the source header.




Improving on QBasic's Random Number Generator
2020-11-17T02:51:23Z
This article was discussed on Hacker News.

Pixelmusement produces videos about MS-DOS games and software.
Each video ends with a short, randomly-selected listing of financial
backers. In ADG Filler #57, Kris revealed the selection process,
and it absolutely fits the channel’s core theme: a QBasic program.
His program relies on QBasic’s built-in pseudo random number generator
(PRNG). Even accounting for the platform’s limitations, the PRNG is much
poorer quality than it could be. Let’s discuss these weaknesses and figure
out how to make the selection more fair.



Kris’s program seeds the PRNG with the system clock (RANDOMIZE TIMER, a
QBasic idiom), populates an array with the backers represented as integers
(indices), continuously shuffles the list until the user presses a key, then
finally prints out a random selection from the array. Here’s a simplified
version of the program (note: QBasic comments start with apostrophe '):

CONST ntickets = 203  ' input parameter
CONST nresults = 12

RANDOMIZE TIMER

DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
    tickets(i) = i
NEXT

CLS
PRINT "Press any key to stop shuffling..."
DO
    i = INT(RND * ntickets)
    j = INT(RND * ntickets)
    SWAP tickets(i), tickets(j)
LOOP WHILE INKEY$ = ""

FOR i = 0 to nresults - 1
    PRINT tickets(i)
NEXT


This should be readable even if you don’t know QBasic. Note: In the real
program, backers at higher tiers get multiple tickets in order to weight
the results. This is accounted for in the final loop such that nobody
appears more than once. It’s mostly irrelevant to the discussion here, so
I’ve omitted it.

The final result is ultimately a function of just three inputs:


  The system clock (TIMER)
  The total number of tickets
  The number of loop iterations until a key press


The second item has the nice property that by becoming a backer you influence
the result.

QBasic RND

QBasic’s PRNG is this 24-bit Linear Congruential Generator (LCG):

uint32_t
rnd24(uint32_t *s)
{
    *s = (*s*0xfd43fd + 0xc39ec3) & 0xffffff;
    return *s;
}


The result is the entire 24-bit state. RND divides this by 2^24 and
returns it as a single precision float so that the caller receives a value
between 0 and 1 (exclusive).

Needless to say, this is a very poor PRNG. The LCG constants are
reasonable, but the choice to limit the state to 24 bits is
strange. According to the QBasic 16-bit assembly (note: the LCG
constants listed here are wrong), the implementation is a full
32-bit multiply using 16-bit limbs, and it allocates and writes a full 32
bits when storing the state. As expected for the 8086, there was nothing
gained by using only the lower 24 bits.

To illustrate how poor it is, here’s a randogram for this PRNG,
which shows obvious structure. (This is a small slice of a 4096x4096
randogram where each of the 2^23 24-bit samples is plotted as two 12-bit
coordinates.)



Admittedly this far overtaxes the PRNG. With a 24-bit state, it’s
only good for 4,096 (2^12) outputs, after which it no longer follows the
birthday paradox: No outputs are repeated even though we should
start seeing some. However, as I’ll soon show, this doesn’t actually
matter.

Instead of discarding the high 8 bits — the highest quality output bits —
QBasic’s designers should have discarded the low 8 bits for the output,
turning it into a truncated 32-bit LCG:

uint32_t
rnd32(uint32_t *s)
{
    *s = *s*0xfd43fd + 0xc39ec3;
    return *s >> 8;
}


This LCG would have the same performance, but significantly better
quality. Here’s the randogram for this PRNG, and it is also heavily
overtaxed (more than 65,536, 2^16 outputs).



It’s a solid upgrade, completely for free!

QBasic RANDOMIZE

That’s not the end of our troubles. The RANDOMIZE statement accepts a
double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754
binary representation are XORed with the next highest 16 bits. The high 16
bits of the PRNG state is set to this result. The lowest 8 bits are
preserved.

To make this clearer, here’s a C implementation, verified against QBasic
7.1:

uint32_t s;

void
randomize(double seed)
{
    uint64_t x;
    memcpy(&x ,&seed, 8);
    s = (x>>24 ^ x>>40) & 0xffff00 | (s & 0xff);
}


In other words, RANDOMIZE only sets the PRNG to one of 65,536 possible
states.

As the final piece, here’s how RND is implemented, also verified against
QBasic 7.1:

float
rnd(float arg)
{
    if (arg < 0) {
        memcpy(&s, &arg, 4);
        s = (s & 0xffffff) + (s >> 24);
    }
    if (arg != 0.0f) {
        s = (s*0xfd43fd + 0xc39ec3) & 0xffffff;
    }
    return s / (float)0x1000000;
}


System clock seed

The TIMER function returns the single precision number of
seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer
interrupt counter). This is strictly time of day, and the current date is
not part of the result, unlike, say, the unix epoch.

This means there are only 1,572,480 distinct values returned by TIMER.
That’s small even before considering that these map onto only 65,536
possible seeds with RANDOMIZE — all of which are fortunately
realizable via TIMER.

Of the three inputs to random selection, this first one is looking pretty
bad.

Loop iterations

Kris’s idea of continuously mixing the array until he presses a key makes
up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000
array swaps — traversing over 2% of the PRNG’s period — and the array
itself acts like an extended PRNG state, supplementing the 24-bit RND
state.

Since iterations fly by quickly, the exact number of iterations becomes
another source of entropy. The results will be quite different if it
runs 214,600 iterations versus 273,500 iterations.

Possible improvement: Only exit the loop when a certain key is pressed. If
any other key is pressed then that input and the TIMER are mixed into
the PRNG state. Mashing the keyboard during the loop introduces more
entropy.

Replacing the PRNG

Since the built-in PRNG is so poor, we could improve the situation by
implementing a new one in QBasic itself. The challenge is that
QBasic has no unsigned integers, not even unsigned integer operators (i.e.
Java and JavaScript’s >>>), and signed overflow is a run-time error. We
can’t even re-implement QBasic’s own LCG without doing long multiplication
in software, since the intermediate result overflows its 32-bit LONG.

Popular choices in these constraints are Park–Miller generator (as
we saw in Bash) or a lagged Fibonacci generator (as used by
Emacs, which was for a long time constrained to 29-bit integers).

However, I have a better idea: a PRNG based on RC4. Specifically,
my own design called Sponge4, a sponge construction
built atop RC4. In short: Mixing in more input is just a matter of running
the key schedule again. Implementing this PRNG requires just two simple
operations: modular addition over 2^8, and array swap. QBasic has a SWAP
statement, so it’s a natural fit!

Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I
can mix in more sources of entropy. With its 1,700-bit state, it can
absorb quite a bit of entropy without loss.

Learning QBasic

Until this past weekend, I had not touched QBasic for about 23 years and
had to learn it essentially from scratch. Though within a couple of hours
I probably already understood it better than I ever had. That’s in large
part because I’m far more experienced, but also probably because QBasic
tutorials are universally awful. Not surprisingly they’re written for
beginners, but they also seem to be all written by beginners, too. I
soon got the impression that QBasic community has usually been another
case of the blind leading the blind.

There’s little direct information for experienced programmers, and even
the official documentation tends to be thin in important places. I wanted
documentation that started with the core language semantics:


  
    The basic types are INTEGER (int16), LONG (int32), SINGLE (float32),
DOUBLE (float64), and two flavors of STRING, fixed-width and
variable-width. Late versions also had incomplete support for a 64-bit,
10,000x fixed-point CURRENCY type.
  
  
    Variables are SINGLE by default and do not need to be declared ahead of
time. Arrays have 11 elements by default.
  
  
    Variables, constants, and functions may have a suffix if their type is
not SINGLE: INTEGER %, LONG &, SINGLE !, DOUBLE #, STRING $,
and CURRENCY @. For functions, this is the return type.
  
  
    Each variable type has its own namespace, i.e. i% is distinct from
i&. Arrays are also their own namespace, i.e. i% is distinct from
i%(0) is distinct from i&(0).
  
  
    Variables may be declared explicitly with DIM. Declaring a variable
with DIM allows the suffix to be omitted. It also locks that name out
of the other type namespaces, i.e. DIM i AS LONG makes any use of i%
invalid in that scope. Though arrays and scalars can still have the same
name even with DIM declarations.
  
  
    Numeric operations with mixed types implicitly promote like C.
  
  
    Functions and subroutines have a single, common namespace regardless of
function suffix. As a result, the suffix can (usually) be omitted at
function call sites. Built-in functions are special in this case.
  
  
    Despite initial appearances, QBasic is statically-typed.
  
  
    The default is pass-by-reference. Use BYVAL to pass by value.
  
  
    In array declarations, the parameter is not the size but the largest
index. Multidimensional arrays are supported. Arrays need not be indexed
starting at zero (e.g. (x TO y)), though this is the default.
  
  
    Strings are not arrays, but their own special thing with special
accessor statements and functions.
  
  
    Scopes are module, subroutine, and function. “Global” variables must be
declared with SHARED.
  
  
    Users can define custom structures with TYPE. Functions cannot return
user-defined types and instead rely on pass-by-reference.
  
  
    A crude kind of dynamic allocation is supported with REDIM to resize
$DYNAMIC arrays at run-time. ERASE frees allocations.
  


These are the semantics I wanted to know getting started. Throw in some
illustrative examples, and then it’s a tutorial for experienced
developers. (Future article perhaps?) Anyway, that’s enough to follow
along below.

Implementing Sponge4

Like RC4, I need a 256-element byte array, and two 1-byte indices, i and
j. Sponge4 also keeps a third 1-byte counter, k, to count input.

TYPE sponge4
    i AS INTEGER
    j AS INTEGER
    k AS INTEGER
    s(0 TO 255) AS INTEGER
END TYPE


QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would
normally be a good match here, but since they’re not arrays, strings are
not compatible with SWAP and are not indexed efficiently. So instead I
accept some wasted space and use 16-bit integers for everything.

There are four “methods” for this structure. Three are subroutines since
they don’t return a value, but mutate the sponge. The last, squeeze,
returns the next byte as an INTEGER (%).

DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)


Initialization follows RC4:

SUB init (r AS sponge4)
    r.i = 0
    r.j = 0
    r.k = 0
    FOR i% = 0 TO 255
        r.s(i%) = i%
    NEXT
END SUB


Absorbing a byte means running the RC4 key schedule one step. Absorbing a
“stop” symbol, for separating inputs, transforms the state in a way that
absorbing a byte cannot.

SUB absorb (r AS sponge4, b AS INTEGER)
    r.j = (r.j + r.s(r.i) + b) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    r.i = (r.i + 1) MOD 256
    r.k = (r.k + 1) MOD 256
END SUB

SUB absorbstop (r AS sponge4)
    r.j = (r.j + 1) MOD 256
END SUB


Squeezing a byte may involve mixing the state first, then it runs the RC4
generator normally.

FUNCTION squeeze% (r AS sponge4)
    IF r.k > 0 THEN
        absorbstop r
        DO WHILE r.k > 0
            absorb r, r.k
        LOOP
    END IF

    r.j = (r.j + r.i) MOD 256
    r.i = (r.i + 1) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION


That’s the entire generator in QBasic! A couple more helper functions will
be useful, though. One absorbs entire strings, and the second emits 24-bit
results.

SUB absorbstr (r AS sponge4, s AS STRING)
    FOR i% = 1 TO LEN(s)
        absorb r, ASC(MID$(s, i%))
    NEXT
END SUB

FUNCTION squeeze24& (r AS sponge4)
    b0& = squeeze%(r)
    b1& = squeeze%(r)
    b2& = squeeze%(r)
    squeeze24& = b2& * &H10000 + b1& * &H100 + b0&
END FUNCTION


QBasic doesn’t have bit-shift operations, so we must make due with
multiplication. The &H is hexadecimal notation.

Putting the sponge to use

One of the problems with the original program is that only the time of day
was a seed. Even were it mixed better, if we run the program at exactly
the same instant on two different days, we get the same seed. The DATE$
function returns the current date, which we can absorb into the sponge to
make the whole date part of the input.

DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE$
absorbstr sponge, MKS$(TIMER)
absorbstr sponge, MKI$(ntickets)


I follow this up with the timer. It’s converted to a string with MKS$,
which returns the little-endian, single precision binary representation as
a 4-byte string. MKI$ does the same for INTEGER, as a 2-byte string.

One of the problems with the original program was bias: Multiplying RND
by a constant, then truncating the result to an integer is not uniform in
most cases. Some numbers are selected slightly more often than others
because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all
the shuffling in the original it probably doesn’t make a practical
difference, but I’d like to avoid it.

In my program I account for it by generating another number if it happens
to fall into that extra “tail” part of the input distribution (very
unlikely for small ntickets). The squeezen function uniformly
generates a number in 0 to N (exclusive).

FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
    DO
       x& = squeeze24&(r) - &H1000000 MOD n
    LOOP WHILE x& < 0
    squeezen% = x& MOD n
END FUNCTION


Finally a Fisher–Yates shuffle, then print the first N elements:

FOR i% = ntickets - 1 TO 1 STEP -1
    j% = squeezen%(sponge, i% + 1)
    SWAP tickets(i%), tickets(j%)
NEXT

FOR i% = 1 TO nresults
    PRINT tickets(i%)
NEXT


Though if you really love Kris’s loop idea:

PRINT "Press Esc to finish, any other key for entropy..."
DO
    c& = c& + 1
    LOCATE 2, 1
    PRINT "cycles ="; c&; "; keys ="; k%

    FOR i% = ntickets - 1 TO 1 STEP -1
        j% = squeezen%(sponge, i% + 1)
        SWAP tickets(i%), tickets(j%)
    NEXT

    k$ = INKEY$
    IF k$ = CHR$(27) THEN
        EXIT DO
    ELSEIF k$ <> "" THEN
        k% = k% + 1
        absorbstr sponge, k$
    END IF
    absorbstr sponge, MKS$(TIMER)
LOOP


If you want to try it out for yourself in, say, DOSBox, here’s the full
source: sponge4.bas




I Solved British Square
2020-10-19T19:32:52Z
Update: I solved another game using essentially the same
technique.

British Square is a 1978 abstract strategy board game which I
recently discovered from a YouTube video. It’s well-suited to play
by pencil-and-paper, so my wife and I played a few rounds to try it out.
Curious about strategies, I searched online for analysis and found
nothing whatsoever, meaning I’d have to discover strategies for myself.
This is exactly the sort of problem that nerd snipes, and so I
sunk a couple of evenings building an analysis engine in C — enough to
fully solve the game and play perfectly.

Repository: British Square Analysis Engine
(and prebuilt binaries)






The game is played on a 5-by-5 grid with two players taking turns
placing pieces of their color. Pieces may not be placed on tiles
4-adjacent to an opposing piece, and as a special rule, the first player
may not play the center tile on the first turn. Players pass when they
have no legal moves, and the game ends when both players pass. The score
is the difference between the piece counts for each player.

In the default configuration, my engine takes a few seconds to explore
the full game tree, then presents the minimax values for the
current game state along with the list of perfect moves. The UI allows
manually exploring down the game tree. It’s intended for analysis, but
there’s enough UI present to “play” against the AI should you so wish.
For some of my analysis I made small modifications to the program to
print or count game states matching certain conditions.

Game analysis

Not accounting for symmetries, there are 4,233,789,642,926,592 possible
playouts. In these playouts, the first player wins 2,179,847,574,830,592
(~51%), the second player wins 1,174,071,341,606,400 (~28%), and the
remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious
the first player has a huge advantage.

Accounting for symmetries, there are 8,659,987 total game states. Of
these, 6,955 are terminal states, of which the first player wins 3,599
(~52%) and the second player wins 2,506 (~36%). This small number of
states is what allows the engine to fully explore the game tree in a few
seconds.

Most importantly: The first player can always win by two points. In
other words, it’s not like Tic-Tac-Toe where perfect play by both
players results in a tie. Due to the two-point margin, the first player
also has more room for mistakes and usually wins even without perfect
play. There are fewer opportunities to blunder, and a single blunder
usually results in a lower win score. The second player has a narrow
lane of perfect play, making it easy to blunder.

Below is the minimax analysis for the first player’s options. The number
is the first player’s score given perfect play from that point — i.e.
perfect play starts on the tiles marked “2”, and the tiles marked “0”
are blunders that lead to ties.

11111
12021
10-01
12021
11111


The special center rule probably exists to reduce the first player’s
obvious advantage, but in practice it makes little difference. Without
the rule, the first player has an additional (fifth) branch for a win by
two points:

11111
12021
10201
12021
11111


Improved alternative special rule: Bias the score by two in favor of
the second player. This fully eliminates the first player’s advantage,
perfect play by both sides results in a tie, and both players have a
narrow lane of perfect play.

The four tie openers are interesting because the reasoning does not
require computer assistance. If the first player opens on any of those
tiles, the second player can mirror each of the first player’s moves,
guaranteeing a tie. Note: The first player can still make mistakes that
results in a second player win if the second player knows when to stop
mirroring.

One of my goals was to develop a heuristic so that even human players
can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was
not able to develop any such heuristic, though I was able to prove
that a greedy heuristic — always claim as much territory as possible —
is often incorrect and, in some cases, leads to blunders.

Engine implementation

As I’ve done before, my engine represents the game using
bitboards. Each player has a 25-bit bitboard representing their
pieces. To make move validation more efficient, it also sometimes tracks
a “mask” bitboard where invalid moves have been masked. Updating all
bitboards is cheap (place(), mask()), as is validating moves
against the mask (valid()).

The longest possible game is 32 moves. This would just fit in 5 bits,
except that I needed a special “invalid” turn, making it a total of 33
bits. So I use 6 bits to store the turn counter.

Besides generally being unnecessary, the validation masks can be derived
from the main bitboards, so I don’t need to store them in the game tree.
That means I need 25 bits per player, and 6 bits for the counter: 56
bits total. I pack these into a 64-bit integer. The first player’s
bitboard goes in the bottom 25 bits, the second player in the next 25
bits, and the turn counter in the topmost 6 bits. The turn counter
starts at 1, so an all zero state is invalid. I exploit this in the hash
table so that zeroed slots are empty (more on this later).

In other words, the empty state is 0x4000000000000 (INIT) and zero
is the null (invalid) state.

Since the state is so small, rather than passing a pointer to a state to
be acted upon, bitboard functions return a new bitboard with the
requested changes… functional style.

    // Compute bitboard+mask where first play is tile 6
    // -----
    // -X---
    // -----
    // -----
    // -----
    uint64_t b = INIT;
    uint64_t m = INIT;
    b = place(b, 6);
    m = mask(m, 6);


Minimax costs

The engine uses minimax to propagate information up the tree. Since the
search extends to the very bottom of the tree, the minimax “heuristic”
evaluation function is the actual score, not an approximation, which is
why it’s able to play perfectly.

When I’ve used minimax before, I built an actual tree data
structure in memory, linking states by pointer / reference. In this
engine there is no such linkage, and instead the links are computed
dynamically via the validation masks. Storing the pointers is more
expensive than computing their equivalents on the fly, so I don’t store
them. Therefore my game tree only requires 56 bits per node — or 64
bits in practice since I’m using a 64-bit integer. With only 8,659,987
nodes to store, that’s a mere 66MiB of memory! This analysis could have
easily been done on commodity hardware two decades ago.

What about the minimax values? Game scores range from -10 to 11: 22
distinct values. (That the first player can score up to 11 and the
second player at most 10 is another advantage to going first.) That’s 5
bits of information. However, I didn’t have this information up front,
and so I assumed a range from -25 to 25, which requires 6 bits.

There are still 8 spare bits left in the 64-bit integer, so I use 6 of
them for the minimax score. Rather than worry about two’s complement, I
bias the score to eliminate negative values before storing it. So the
minimax score rides along for free above the state bits.

Hash table (memoization)

The vast majority of game tree branches are redundant. Even without
taking symmetries into account, nearly all states are reachable from
multiple branches. Exploring all these redundant branches would take
centuries. If I run into a state I’ve seen before, I don’t want to
recompute it.

Once I’ve computed a result, I store it in a hash table so that I can
find it later. Since the state is just a 64-bit integer, I use an
integer hash function to compute a starting index from which to
linearly probe an open addressing hash table. The entire hash table
implementation is literally a dozen lines of code:

uint64_t *
lookup(uint64_t bitboard)
{
    static uint64_t table[N];
    uint64_t mask = 0xffffffffffffff; // sans minimax
    uint64_t hash = bitboard;
    hash *= 0xcca1cee435c5048f;
    hash ^= hash >> 32;
    for (size_t i = hash % N; ; i = (i + 1) % N) {
        if (!table[i] || table[i]&mask == bitboard) {
            return &table[i];
        }
    }
}


If the bitboard is not found, it returns a pointer to the (zero-valued)
slot where it should go so that the caller can fill it in.

Canonicalization

Memoization eliminates nearly all redundancy, but there’s still a major
optimization left. Many states are equivalent by symmetry or reflection.
Taking that into account, about 7/8th of the remaining work can still be
eliminated.

Multiple different states that are identical by symmetry must to be
somehow “folded” into a single, canonical state to represent them all.
I do this by visiting all 8 rotations and reflections and choosing the
one with the smallest 64-bit integer representation.

I only need two operations to visit all 8 symmetries, and I chose
transpose (flip around the diagonal) and vertical flip. Alternating
between these operations visits each symmetry. Since they’re bitboards,
transforms can be implemented using fancy bit-twiddling hacks.
Chess boards, with their power-of-two dimensions, have useful properties
which these British Square boards lack, so this is the best I could come
up with:

// Transpose a board or mask (flip along the diagonal).
uint64_t
transpose(uint64_t b)
{
    return ((b >> 16) & 0x00000020000010) |
           ((b >> 12) & 0x00000410000208) |
           ((b >>  8) & 0x00008208004104) |
           ((b >>  4) & 0x00104104082082) |
           ((b >>  0) & 0xfe082083041041) |
           ((b <<  4) & 0x01041040820820) |
           ((b <<  8) & 0x00820800410400) |
           ((b << 12) & 0x00410000208000) |
           ((b << 16) & 0x00200000100000);
}

// Flip a board or mask vertically.
uint64_t
flipv(uint64_t b)
{
    return ((b >> 20) & 0x0000003e00001f) |
           ((b >> 10) & 0x000007c00003e0) |
           ((b >>  0) & 0xfc00f800007c00) |
           ((b << 10) & 0x001f00000f8000) |
           ((b << 20) & 0x03e00001f00000);
}


These transform both players’ bitboards in parallel while leaving the
turn counter intact. The logic here is quite simple: Shift the bitboard
a little bit at a time while using a mask to deposit bits in their new
home once they’re lined up. It’s like a coin sorter. Vertical flip is
analogous to byte-swapping, though with 5-bit “bytes”.

Canonicalizing a bitboard now looks like this:

uint64_t
canonicalize(uint64_t b)
{
    uint64_t c = b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    return c;
}


Callers need only use canonicalize() on values they pass to lookup()
or store in the table (via the returned pointer).

Developing a heuristic

If you can come up with a perfect play heuristic, especially one that
can be reasonably performed by humans, I’d like to hear it. My engine
has a built-in heuristic tester, so I can test it against perfect play
at all possible game positions to check that it actually works. It’s
currently programmed to test the greedy heuristic and print out the
millions of cases where it fails. Even a heuristic that fails in only a
small number of cases would be pretty reasonable.




w64devkit: (Almost) Everything You Need
2020-09-25T00:04:11Z
This article was discussed on Hacker News.

This past May I put together my own C and C++ development
distribution for Windows called w64devkit. The entire
release weighs under 80MB and requires no installation. Unzip and run it
in-place anywhere. It’s also entirely offline. It will never
automatically update, or even touch the network. In mere seconds any
Windows system can become a reliable development machine. (To further
increase reliability, disconnect it from the internet.) Despite
its simple nature and small packaging, w64devkit is almost everything
you need to develop any professional desktop application, from a
command line utility to a AAA game.



I don’t mean this in some useless Turing-complete sense, but in
a practical, get-stuff-done sense. It’s much more a matter of
know-how than of tools or libraries. So then what is this “almost”
about?


  
    The distribution does not have WinAPI documentation. It’s notoriously
difficult to obtain and, besides, unfriendly to redistribution.
It’s essential for interfacing with the operating system and difficult
to work without. Even a dead tree reference book would suffice.
  
  
    Depending on what you’re building, you may still need specialized
tools. For instance, game development requires tools for editing art
assets.
  
  
    There is no formal source control system. Git is excluded per the
issues noted in the announcement, and my next option, Quilt,
has similar limitations. However, diff and patch are included,
and are sufficient for a kind of old-school, patch-based source
control. I’ve used it successfully when dogfooding w64devkit in a
fresh Windows installation.
  


Everything else

As I said in my announcement, w64devkit includes a powerful text editor
that fulfills all text editing needs, from code to documentation. The
editor includes a tutorial (vimtutor) and complete, built-in manual
(:help) in case you’re not yet familiar with it.

What about navigation? Use the included ctags to generate a
tags database (ctags -R), then jump instantly to any
definition at any time. No need for that Language Server Protocol
rubbish. This does not mean you must laboriously type identifiers
as you work. Use built-in completion!

Build system? That’s also covered, via a Windows-aware unix-like
environment that includes make. Learning how to use it is a
breeze. Software is by its nature unavoidably complicated, so don’t
make it more complicated than necessary.

What about debugging? Use the debugger, GDB. Performance problems? Use
the profiler, gprof. Inspect compiler output either by asking for it
(-S) or via the disassembler (objdump -d). No need to go online for
the Godbolt Compiler Explorer, as slick as it is. If the compiler
output is insufficient, use SIMD intrinsics. In the worst case
there are two different assemblers available. Real time graphics? Use an
operating system API like OpenGL, DirectX, or Vulkan.

w64devkit really is nearly everything you need in a single, no
nonsense, fully-offline package! It’s difficult to emphasize this
point as much as I’d like. When interacting with the broader software
ecosystem, I often despair that software development has lost its
way. This distribution is my way of carving out an escape from some
of the insanity. As a C and C++ toolchain, w64devkit by default produces
lean, sane, trivially-distributable, offline-friendly artifacts. All
runtime components in the distribution are static link only,
so no need to distribute DLLs with your application either.

Customize the distribution, own the toolchain

While most users would likely stick to my published releases, building
w64devkit is a two-step process with a single build dependency, Docker.
Anyone can easily customize it for their own needs. Don’t care about
C++? Toss it to shave 20% off the distribution. Need to tune the runtime
for a specific microarchitecture? Tweak the compiler flags.

One of the intended strengths of open source is users can modify
software to suit their needs. With w64devkit, you own the toolchain
itself. It is one of your dependencies after all. Unfortunately
the build initially requires an internet connection even when working
from source tarballs, but at least it’s a one-time event.

If you choose to take on dependencies, and you build those
dependencies using w64devkit, all the better! You can tweak them to your
needs and choose precisely how they’re built. You won’t be relying on
the goodwill of internet randos nor the generosity of a free package
registry.

Customization examples

Building existing software using w64devkit is probably easier than
expected, particularly since much of it has already been “ported” to
MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure
scripts. They never work in w64devkit despite having everything they
technically need. So other than that, here’s a demonstration of building
some popular software.

One of my coworkers uses his own version of PuTTY
patched to play more nicely with Emacs. If you wanted to do the same,
grab the source tarball, unpack it using the provided tools, then in the
unpacked source:

$ make -C windows -f Makefile.mgw


You’ll have a custom-built putty.exe, as well as the other tools. If you
have any patches, apply those first!

Would you like to embed an extension language in your application? Lua
is a solid choice, in part because it’s such a well-behaved dependency.
After unpacking the source tarball:

$ make PLAT=mingw


This produces a complete Lua compiler, runtime, and library. It’s not
even necessary to use the Makefile, as it’s nearly as simple as “cc
*.c” — painless to integrate or embed into any project.

Do you enjoy NetHack? Perhaps you’d like to try a few of the custom
patches. This one is a little more complicated, but I was able to
build NetHack 3.6.6 like so:

$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"


NetHack has a bug necessitating -fcommon. If you have any
patches, apply them with patch before the last step. I won’t belabor it
here, but with just a little more effort I was also able to produce a
NetHack binary with curses support via PDCurses — statically-linked
of course.

How about my archive encryption tool, Enchive? The one that
even works with 16-bit DOS compilers. It requires nothing special
at all!

$ make


w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM.
This means you can modify and recompile these tools without going
through the Docker build. Sadly busybox-w32 cannot host itself,
though it’s close. I’d love if w64devkit could fully host itself, and
so Docker — and therefore an internet connection and such — would only
be needed to bootstrap, but unfortunately that’s not realistic given the
state of the GNU components.

Offline and reliable

Software development has increasingly become dependent on a constant
internet connection. Robust, offline tooling and development is
undervalued.

Consider: Does your current project depend on an external service? Do
you pay for this service to ensure that it remains up? If you pull your
dependencies from a repository, how much do you trust those who maintain
the packages? Do you even know their names? What would be your
project’s fate if that service went down permanently? It will someday,
though hopefully only after your project is dead and forgotten. If you
have the ability to work permanently offline, then you already have
happy answers to all these questions.