null program

Concurrent, atomic MSI hash tables

2026-05-06T02:01:17Z

Readers will be familiar with Mask-Step-Index (MSI) hash tables, a technique for building fast, open-addressed hash tables in a dozen lines of code. If multiple threads or processes access an MSI table with at least one still inserting elements, care must be taken to avoid data races. This article will show how to add atomic operations to MSI tables in order to support different concurrency constraints.

Let’s begin with the simplest case: An integer hash set, no deletions, only one insert thread (single producer), and consumers do not care about insert order. That is, the producer inserts A then B, but consumers may observe B in the table before A. Suppose this is the hash table in the single-threaded case:

int32_t *lookup(int32_t key, int32_t *table, int exp)
{
    uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (32 - exp)) | 1;
    for (uint32_t index = hash;;) {
        index = (index + step) & mask;
        if (!table[index] || table[index]==key) {
            return table + index;
        }
    }
}

Keys must be non-zero, and tables are zero-initialized. Usage example:

    // Initialization
    enum { exp = 8 };
    int32_t table[1<<8] = {};

    // Producer
    for (int i = 0; i < nkeys; i++) {
        *lookup(keys[i], table, exp) = keys[i];
    }

    // Consumer
    int32_t key = 1234;
    bool present = *lookup(key, table, exp);

The only problem is the data race on table slots. Since consumers can tolerate out-of-order insertions, ordering does not matter and relaxed atomics eliminate the data race. Insert and query now have different requirements, so it makes sense to distinguish them. Starting with the latter:

bool contains(int32_t key, int32_t *table, int exp)
{
    uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (32 - exp)) | 1;
    for (uint32_t index = hash;;) {
        index = (index + step) & mask;
        int32_t k = __atomic_load_n(table+index, __ATOMIC_RELAXED);
        if (!k) {
            return false;
        } else if (k == key) {
            return true;
        }
    }
}

Note how all elements are accessed by atomic loads, as a producer may store to any slot at any time. Now producers:

bool insert(int32_t key, int32_t *table, int exp)
{
    uint64_t hash = ((uint64_t)key * 1111111111111111111u) >> 32;
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (32 - exp)) | 1;
    for (uint32_t index = hash;;) {
        index = (index + step) & mask;
        if (!table[index]) {
            __atomic_store_n(table+index, key, __ATOMIC_RELAXED);
            return true;
        } else if (table[index] == key) {
            return false;
        }
    }
}

This function may load elements non-atomically because there’s only one producer: the current thread. This idea could not be expressed were the type system involved, e.g. _Atomic, but GCC atomics do not involve require such special qualifiers. Stores on the other hand are concurrent with consumers, requiring an atomic store. Single-producer, multiple-consumer (SPMC) usage is nearly identical to the single-threaded case:

    // Producer
    for (int i = 0; i < nkeys; i++) {
        insert(keys[i], table, exp);
    }

    // Consumer
    int32_t key = 1234;
    bool present = contains(key, table, exp);

A concurrent integer hash table is contrived and unrealistic. In a real program a key likely carries some broader semantic meaning. For example, if that “integer” is actually a memory offset known as a pointer, then it points at some object, and it is important that stores to that object happen before consumers observe the pointer in the table:

bool   insert(Thing *thing, Thing **table, int exp)
Thing *lookup(Key key, Thing **table, int exp)

Where usage might look like:

    // Producer
    for (int i = 0; i < nthings; i++) {
        things[i].key = ...;  // update/init object
        insert(things+i, table, exp);
    }

    // Consumer
    bool present = !!find((Key){...}, table, exp);

In this case relaxed atomics are insufficient. Updates to the inserted object may be reordered after the insertion, and consumers will race on those updates. In this case we upgrade to acquire-release:

Thing *lookup(Key key, Thing **table, int exp)
{
    // ...
    for (...) {
        // ...
        Thing *thing = __atomic_load_n(table+index, __ATOMIC_ACQUIRE);
        if (!thing || thing->key==key) {
            return thing;
        }
    }
}

bool insert(Thing *thing, Thing **table, int exp)
{
    // ...
    for (...) {
        // ...
        if (!table[index]) {
            __atomic_store_n(table+index, thing, __ATOMIC_RELEASE);
            return true;
        } else if (table[index]->key == thing->key) {
            return false;
        }
    }
}

In this case producer and consumer synchronize on the atomics. Producer stores are ordered before the release, and consumer loads are ordered after the acquire. Objects are not modified once in the table, so atomics are not required for their fields. On some architectures, including x86, there will be no indication at the ISA level that atomics are in use — i.e. this likely generates the same code as the single-threaded version — and these atomics merely constrain the compiler’s instruction scheduling.

As a side effect of synchronizing, consumers will now observe insertions in the same order as the producer. This is a more realistic and practical situation than an integer hash table.

Multiple producers

The multiple-producer case (MPMC) is more complicated for producers, but consumers are unaffected, so we need only modify insertion. Still without any locks, we will optimistically update the table. We look at the current slot item, and if nothing is present compare-and-swap the new element in place. On failure we acquire the element that won the race, continuing as though it’s what we saw in the first place.

bool insert(Thing *thing, Thing **table, int exp)
{
    // ...
    for (...) {
        // ...
        Thing *current = __atomic_load_n(table+index, __ATOMIC_ACQUIRE);
        if (!current) {
            int pass = __ATOMIC_RELEASE;
            int fail = __ATOMIC_ACQUIRE;
            if (__atomic_compare_exchange_n(
                    table+index, &current, thing, 0, pass, fail)) {
                return true;
            }
        }
        if (current->key == thing->key) {
            return false;
        }
    }
}

This is quite similar my hash trie concurrency enhancement a few years ago.

I have officially retired from Emacs

2026-04-26T00:00:00Z

This article was discussed on reddit and on Hacker News.

This past Tuesday I typed C-x C-c in Emacs for the last time after 20 years of daily use. Though nearly half that time was gradually retiring it, switching to modal editing, then to Vim. Emacs is a platform, and I’d grown accustomed to its applications, especially those I built myself. There was no particular hurry, so replacements came slowly. With my newly-acquired superpowers I could knock out the last two pieces in a few days’ work, namely M-x calc with stackcalc and Elfeed with Elfeed2. I’m especially excited about the latter because it already exceeds the original. Both are multi-platform, native C++ GUI applications using native UI components.

These actively-in-use packages require new maintainers (apply on the project’s issues/discussion):

No wonder it took so long for me to move on! I’m not handing these off to just anyone, and you’ll need to establish your reputation. Having already made contributions is a good sign, even if never merged. I’m willing to transfer them off my namespace, though you’ll need to manage the Melpa hand-off (on which I’ll sign-off). If there are no takers, these projects will be archived but not deleted.

Trying out wxWidgets

The Emacs Calculator is amazing and the best calculator I’ve ever used, which is why nothing I could find was going to replace it. My clone uses GMP and MPFR for multi-precision, so it’s far faster, as to be expected, but it’s not nearly at feature parity. It’s missing esoteric features including symbolic processing. Though it’s enough to cover all of my own usage. I can add more features later. The Emacs Calculator manual served as a specification when building stackcalc.

Elfeed has been a cornerstone of my daily routines for the past 13 years. Nothing else I’ve found scratches that itch for me, so I’ve always known it would require a rewrite someday. Knowing it would take a few weeks of work, and that I already had the feed reader I wanted, made motivation difficult to find. Though now that I can accomplish ~3 weeks of old-way work in a new-way day, this sort of project becomes that much easier to start and finish. Though it’s not yet at a 1.0 release, after a couple days Elfeed2 was working well enough to replace the original Elfeed.

While Dear ImGui was the right choice for dcmake, it would not be so for these two applications. Active rendering doesn’t suit a feed reader left running all day, and I needed a richer toolkit. Professionally I work in Qt, but I wanted something lighter-weight for my projects, accessible via CMake FetchContent. That naturally led to wxWidgets. While it has issues — mitigatable character encoding problems, accidental quadratic time in many places — it’s worked better than I anticipated, letting me rapidly produce native-looking applications on Windows, macOS, and Linux.

Unlike Dear ImGui, wxWidgets is a platform, including sane I/O and path handling. I mostly don’t need platform layers when building applications like these. I can simply rely on wxWidgets’ utilities.

Both of these projects build out-of-the-box on w64devkit thanks to the dependencies being FetchContent-compatible. On all platforms you just need a C++ toolchain and CMake:

$ cmake -B build
$ cmake --build build

Now that I have experience with wxWidgets, learning its limitations and capabilities, it’s likely to be a foundation of most of my GUI projects to come, except where something like Dear ImGui is a better git.

My brave new code-signing world

2026-04-25T18:12:29Z

The new w64devkit release two weeks ago is the first to be code-signed with my identity, verified by Microsoft’s certificate chain. Currently only the release packaging is signed — the self-extracting archive and its payload — but I will soon code-sign individual EXEs and DLLs within the distribution. In fact, all Windows builds of my project releases have been code-signed the past two weeks, including dcmake, and so should everything going forward. My signing identity builds reputation with each download, so users will have an easier time with SmartScreen, and security software generally. Azure Artifact Signing creates the actual signature, but the rest is done with new infrastructure I built myself, aas-sign. As is often the case, the existing options were deficient for my needs, so I had to build it myself.

This code-signing is not free, and simply having aas-sign on hand, or using the GitHub Actions action, is insufficient. You must be serious enough to spend US$10/month for the Azure subscription. After that you are subjected to the labyrinth that is the Azure portal, the most confusing UI I’ve ever used. Luckily we live in an age of wonders, and I could describe to Claude in Chrome what I wanted and it would happen (Sonnet works better than Opus for this). It took as much time to figure out Azure as I spent creating a fully-functional, native debugger front-end. Clear your schedule if you’re going to try it yourself. If it weren’t for AI assistance I would have given up.

The one-time setup process is only open to North America, and involves sharing identify documents (i.e. driver’s license) with Microsoft. Unlike the rest of Azure, that part was streamlined and fairly painless. Between the cost and this requirement, this is a niche space.

However, if this is your niche, aas-sign is currently the best software available. It’s the tool Microsoft should have written, but didn’t due to ongoing institutional failures. The alternatives are a pair of tools: Azure CLI (Python) combined with either Jsign (Java) or SignTool.exe (Windows only). All impose artificial runtime constraints hostile to build pipeline composablility. Poor engineering. In contrast, aas-sign is a native, multi-platform, single-file application.

If you know this space, osslsigncode probably comes to mind, but it produces signatures itself. It doesn’t interface with Azure and so has no role here aside from semi-reliable validation. The most popular use case is code-signing with self-signed certificates, but that actually makes everything worse.

There are two modes for aas-sign: Laptop and Action. Laptop mode is the most compelling, so we’ll start with that, but Action mode is the most useful in practice.

Laptop/desktop mode

Suppose you built an EXE or DLL, and would like to code-sign and publish it. Typically that looks like this:

$ aas-sign sign myapp.exe myapp.dll

It computes an Authenticode for each (concurrently), sends it off to Azure, gets back a signature, then a countersignature, and embeds the signatures in the images. If you have multiple signing identities then you might use --as (“sign as”):

$ aas-sign sign --as eus:contoso:jdoe myapp.exe myapp.dll

The colon-delimited triple is my own invention to combine region (East US), tenant (Contoso), and profile (J. Doe) into one string. The first time you use it, and every ~90 days thereafter, you’ll need to authenticate with Azure first:

$ aas-sign login

This will open a browser (just like az login) to log in, from which it will obtain a token than can be used to obtain signing tokens. (Yes, a token to get tokens; I’m concealing as much complexity as possible.) You might also want to establish a default identity, as typically you’d only have one:

$ aas-sign config eus:contoso:jdoe

Or all at once:

$ aas-sign login eus:contoso:jdoe

My goal was, after enduring the Azure portal sign-up, to maximally streamline code-signing.

Action mode

Manually building, signing, and publishing releases is easy and might be fine if you’re not releasing too frequently — or too ininfrequently that you forget how to do it — but likely you’d want to automate this process. I was stubborn about it myself, until Peter0x44 pushed me hard enough to take it seriously, for which I’m grateful. There’s an official GitHub Action to code-sign with Azure, but it requires a Windows runner, fatally limiting for my own needs. So aas-sign also defines a code-signing action. The previous example would have this in its own action:

  - name: Sign
    uses: skeeto/aas-sign@v1.0.0
    with:
      endpoint:  ${{ secrets.TRUSTED_SIGNING_ENDPOINT }}
      account:   ${{ secrets.TRUSTED_SIGNING_ACCOUNT }}
      profile:   ${{ secrets.CERTIFICATE_PROFILE }}
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      files: |
        myapp.exe
        myapp.dll

The secrets are bunch of strings you (or your AI agent) retrieve from the Azure portal. You also need to create Federated Identity Credential (FIC) for each repository, which I suggest triggering on an environment. (This all may sound like a joke but it’s real.) Again, just ask an AI to do all this stuff. The mandatory Azure interfacing limits how much I can streamline this process. Then aas-sign combines these with per-job tokens GitHub injects into the runner to authenticate (via the FIC) and sign.

I’ve gone through this a number of times, and the AI breezes through the GitHub UI, but struggles through the Azure portal — objective evidence of how awful it is. Idea for a UI benchmark: How many AI tokens does it take to accomplish typical activities?

For w64devkit, my plan is to run aas-sign inside the Docker build and sign executables in the container before it’s SFX-packaged. This is impossible with SignTool.exe and needlessly frictional with Jsign (requires at least a JRE if not a JDK). The easiest path forward was to literally build my own tool from scratch.

I’m considering aas-sign as a new w64devkit command, but it’s so niche that I’m likely to be its sole user. On the other hand, those already running w64devkit in GitHub Actions could use it in Action mode to code-sign their builds without any additional tools.

dcmake: a new CMake debugger UI

2026-04-07T03:04:02Z

CMake has a --debugger mode since 3.27 (July 2023), allowing software to manipulate it interactively through the Debugger Adaptor Protocol (DAP), an HTTP-like protocol passing JSON messages. Debugger front-ends can start, stop, step, breakpoint, query variables, etc. a live CMake. When I came across this mode, I immediately conceived a project putting it to use. Thanks to recent leaps in software engineering productivity, I had a working prototype in 30 minutes, and by the end of that same day, a complete, multi-platform, native, GUI application. I named it dcmake (“debugger for CMake”). I’ve tested it on macOS, Windows, and Linux. Despite only being couple days old, it’s one of the coolest things I’ve ever built. Prior to 2026, I estimate it would have taken me a month to get the tool to this point.

It has a Dear ImGui interface, which I’ve experienced as a user but never built on myself before. Specifically the docking branch. In a sense it’s a toolkit for building debuggers, so it’s playing an enormous role in how quickly I put this project together. All of the “windows” tear out and may be free-floating or docked wherever you like, closely matching the classic Visual Studio UI. I borrowed all the same keybindings: F10 to step over, F11 to step in, F5 to start/continue, shift+F5 to stop. Click on line numbers to toggle breakpoints, right click to run-to-line, hover over variables with the mouse to see their values. Nearly every every UI state persists across sessions, and it opens nearly instantly.

This is just one of many situations I’ve used AI the past month for UI development, and it’s been shockingly effective. I can describe roughly the interface I want, and the AI makes it happen in a matter of minutes. It understands what I mean, filling in the details, sometimes anticipating what I’ll ask for next. If I’m unsure how I want a UI to work, it also offers good advice. If I need simple icons and such, it can draw those, too. It’s all incredibly empowering.

On macOS and Linux it runs on top of GLFW with OpenGL 3 rendering, and on Windows it uses native Win32 windowing and DirectX 11 rendering.

Program arguments given to dcmake populate the top-left arguments text input, which go straight into CMake on start. So you can prepend d to your CMake configuration command to run it inside the debugger. Passing no arguments sets it up for “standard” -B build configuration.

In general, if you don’t have anywhere in particular to look, likely the first thing to do after starting dcmake (in a project) is press F10. It starts CMake paused on the first line of CMakeLists.txt, or whatever script you’re debugging. If you’re trying out dcmake for the first time, that’s a good place to start. Keep pressing F10 to step through that script, watching it run through its configuration. If you F11 through the script then you’ll dive deeper and deeper into CMake itself, which can be insightful.

There is no point in trying to debug --build invocations. It’s just a uniform interface to the underlying build tool, and there is no CMake left to debug at that point. However, it does work with -P script mode invocations. CMake can operate as a platform-agnostic shell script-like tool, but unlike shell scripts you can step through them with a debugger like dcmake.

On Windows it supports Unicode paths all the way through, without a UTF-8 manifest. This took some special care, in particular avoiding any C++ standard library I/O functionality. Current frontier AI cannot handle this detail on their own. The macOS platform required a bit of Objective-C, as it often does, and I’m happy I didn’t have to figure that part out myself.

The next release of w64devkit will include dcmake, complementing its recent addition of CMake. This new tool has already proven useful in its own development.

2026 has been the most pivotal year in my career… and it's only March

2026-03-29T21:38:22Z

In February I left my employer after nearly two decades of service. In the moment I was optimistic, yet unsure I made the right choice. Dust settled, I’m now absolutely sure I chose correctly. I’m happier and better for it. There were multiple factors, but it’s not mere chance it coincides with these early months of the automation of software engineering. I left an employer that is years behind adopting AI to one actively supporting and encouraging it. As of March, in my professional capacity I no longer write code myself. My current situation was unimaginable to me only a year ago. Like it or not, this is the future of software engineering. Turns out I like it, and having tasted the future I don’t want to go back to the old ways.

In case you’re worried, this is still me. These are my own words. Writing is thinking, and it would defeat the purpose for an AI to write in my place on my personal blog. That’s not going to change.

I still spend much time reading and understanding code, and using most of the same development tools. It’s more like being a manager, orchestrating a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing the vegetables, I conjure a helper to do it while I continue to run the kitchen. I haven’t managed people in some 20 years now, but I can feel those old muscles being put to use again as I improve at this new role. Will these kitchens still need human chefs like me by the end of the decade? Unclear, and it’s something we all need to prepare for.

My situation gave me an experience onboarding with AI assistance — a fast process given a near-instant, infinitely-patient helper answering any question about the code. By second week I was making substantial, wide contributions to the large C++ code base. It’s difficult to attach a quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for certain this wouldn’t have been possible without AI. The bottlenecks have shifted from producing code, which now takes relatively no time at all, to other points, and we’re all still trying to figure it out.

My personal programming has transformed as well. Everything I said about AI in late 2024 is, as I predicted, utterly obsolete. There’s a huge, growing gap between open weight models and the frontier. Models you can run yourself are toys. In general, almost any AI product or service worth your attention costs money. The free stuff is, at minimum, months behind. Most people only use limited, free services, so there’s a broad unawareness of just how far AI has advanced. AI is now highly skilled at programming, and better than me at almost every programming task, with inhumanly-low defect rates. The remaining issues are mainly steering problems: If AI code doesn’t do what I need, likely the AI writing it didn’t understand what I needed.

I’ll still write code myself from time to time for fun — minimalist, with my style and techniques — the same way I play shogi on the weekends for fun. However, artisan production is uneconomical in the presence of industrialization. AI makes programming so cheap that only the rich will write code by hand.

A small part of me is sad at what is lost. A bigger part is excited about the possibilities of the future. I’ve always had more ideas than time or energy to pursue them. With AI at my command, the problem changes shape. I can comfortably take on complexity from which I previously shied away, and I can take a shot at any idea sufficiently formed in my mind to prompt an AI — a whole skill of its own that I’m actively developing.

For instance, a couple weeks ago I put AI to work on a problem, and it produced a working solution for me after ~12 hours of continuous, autonomous work, literally while I slept. The past month w64devkit has burst with activity, almost entirely AI-driven. Some of it architectural changes I’ve wanted for years, but would require hours of tedious work, and so I never got around to it. AI knocked it out in minutes, with the new architecture opening new opportunities. It’s also taken on most of the cognitive load of maintenance.

Quilt.cpp

So far the my biggest, successful undertaking is Quilt.cpp, a C++ clone of Quilt, an early, actively-used source control system for patch management. Git is a glaring omission from the almost complete w64devkit, due platform and build issues. I’ve thought Quilt could fill some of that source control hole, except the original is written in Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since Quilt is conceptually simple, and I could lean on busybox-w32 diff and patch, I’ve considered writing my own implementation, just as I did pkg-config, but I never found the energy to do it.

Then I got good enough with AI to knock out a near feature-complete clone in about four days, including a built-in diff and patch so it doesn’t actually depend on external tools (except invoking $EDITOR). On Windows it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases. The source is distributed as an amalgamation, a single file quilt.cpp per its namesake:

$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file]  [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt  --help" for details on a specific command.

It supports Windows and POSIX, and runs ~5x faster than the original. AI developed it on Windows, Linux, and macOS: It’s best when the AI can close the debug loop and tackle problems autonomously without involving a human slowpoke. The handful of “not implemented” parts aren’t because they’re too hard — each would probably take an AI ~10 minutes — but deliberate decisions of taste.

There’s an irony that the reason I could produce Quilt.cpp with such ease is also a reason I don’t really need it anymore.

I changed the output of quilt mail to be more Git-compatible. The mbox produced by Quilt.cpp can be imported into Git with a plain git am:

$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox

The idea being that I could work on a machine without Git (e.g. Windows XP), and copy/mail the mbox to another machine where Git can absorb it as though it were in Git the whole time. git format-patch to quilt import sends commits in the opposite direction, useful for manually testing Quilt.cpp on real change sets.

To be clear, I could not have done this if the original Quilt did not exist as a working program. I began with an AI generating a conformance suite based on the original, its documentation, and other online documentation, validating that suite against the original implementation (see -DQUILT_TEST_EXECUTABLE). Then had another AI code to the tests, on architectural guidance from me, with -D_GLIBCXX_DEBUG and sanitizers as guardrails. That was day one. The next three days were lots of refining and iteration as I discover the gaps in the test suite. I’d prompt AI to compare Quilt.cpp to the original Quilt man page, add tests for missing features, validate the new tests against the original Quilt, then run several agents to fix the tests. While they worked I’d try the latest build and note any bugs. As of this writing, the result is about equal parts test and non-test, ~9KLoC each.

I’m likely to use this technique to clone other tools with implementations unsuitable for my purposes. I learned quite a bit from this first attempt.

Why C++ instead of my usual choice of C? As we know, conventional C is highly error-prone. Even AI has trouble with it. In the ~9k lines of C++ that is Quilt.cpp, I am only aware of three memory safety errors by the AI. Two were null-terminated string issues with strtol, where the AI was essentially writing C instead of C++, after which I directed the AI to use std::from_chars and drop as much direct libc use as possible. (The other was an unlikely branch with std::vector::back on an empty vector.) We can rescue C with better techniques like arena allocation, counted strings, and slices, but while (current) state of the art AI understands these things, it cannot work effectively with them in C. I’ve tried. So I picked C++, and from my professional work I know AI is better at C++ than me.

Also like a manager, I have not read most of the code, and instead focused on results, so you might say this was “vibe-coded.” It is thoroughly tested, though I’m sure there are still bugs to be ironed out, especially on the more esoteric features I haven’t tried by hand yet.

Let’s discuss tools

After opposing CMake for years, you may have noticed the latest w64devkit now includes CMake and Ninja. What happened? Preparing for my anticipated employment change, this past December I read Professional CMake. I realized that my practical problems with CMake were that nearly everyone uses it incorrectly. Most CMake builds are a disaster, but my new-found knowledge allows me to navigate the common mistakes. Only high profile open source projects manage to put together proper CMake builds. Otherwise the internet is loaded with CMake misinformation. Similar to AI, if you’re not paying for CMake knowledge then it’s likely wrong or misleading. So I highly recommend that book!

Frontier AI is very good with CMake. When a project has a CMake build that isn’t too badly broken, just tell AI to fix it, without any specifics, and build problems disappear in mere minutes without having to think about it. It’s awesome. Combine it with the previous discussion about tests making AI so much more effective, and that it also knows CTest well, and you’ve got a killer formula. I’m more effective with CTest myself merely from observing how AI uses it. AI (currently) cannot use debuggers, so putting powerful, familiar testing tools in its hands helps a lot, versus the usual bespoke, debugger-friendly solutions I prefer.

Similar to solving CMake problems: Have a hairy merge conflict? Just ask AI resolve it. It’s like magic. I no longer fear merge conflicts.

So part of my motivation for adding CMake to w64devkit was anticipation of projects like Quilt.cpp, where they’d be available to AI, or at least so I could use the tools the AI used to build/test myself. It’s already paid for itself, and there’s more to come.

For agent software, on personal projects I’m using Claude Code. It’s a great value, cheaper than paying API rates but requires working around 5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to have enough to explore all my ideas. Be warned: Anthropic software is quite buggy, more so than industry average, and it’s obvious that they never even start, let alone test, some of their released software on disfavored platforms (Windows, Android). Don’t expect to use Claude Code effectively for native Windows platform development, which sadly includes w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a bottleneck on QA, and unable to fit AI in that role they don’t bother. You can theoretically report bugs on GitHub, but they’re just ignored and closed. (Why don’t they have AI agents jumping on this wealth of bug reports?)

At work I’m using Cursor where I get a choice of models. My favorite for March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude Code by a small margin. It’s immediately obvious that Cursor is better agent software than Claude Code. It’s more robust, more featureful, and with a clearer UI than Claude Code. It has no trouble on Windows and can drive w64devkit flawlessly. It’s also more expensive than Claude Code. My employer currently spends ~US$250/mo on my AI tokens, dirt cheap considering what they’re getting out of it. I have bottlenecks elsewhere that keep me from spending even more.

As a general rule, for software engineering always use the smartest model available. The cheaper, dumber models cost more in the long run. It takes more tokens to achieve worse results, which costs more human time to sort out.

Neither Cursor nor Claude Code are open source, so what are the purists to do, even if they’re willing to pay API rates for tokens? Sadly I have no answers for you. I haven’t gotten any open source agent software actually working, and it seems they may lack the necessary secret sauce.

Update: Several folks suggested I give OpenCode another shot, and this time I got over the configuration hurdle. Single executable, slick interface, and unlike Claude Code, I observed no bugs in my brief trial. Give that a shot if you’re looking for an open source client.

The future is going to be weird. My experience is only a peek at what’s to come, and my head is still spinning. However, the more I adapt to the changes, the better I feel. If you’re feeling anxious like I was, don’t flinch from improving your own AI knowledge and experience.

Frankenwine: Multiple personas in a Wine process

2026-01-19T21:51:38Z

I came across a recent article on making Linux system calls from a Wine process. Windows programs running under Wine are still normal Linux processes and may interact with the Linux kernel like any other process. None of this was surprising, and the demonstration works just as I expect. Still, it got the wheels spinning and I realized an almost practical application: build my pkg-config implementation such that on Windows pkg-config.exe behaves as a native pkg-config, but when run under Wine this same binary takes the persona of a Linux program and becomes a cross toolchain pkg-config, bypassing Win32 and talking directly with the Linux kernel. Cosmopolitcan Libc cleverly does this out-of-the-box, but in this article we’ll mash together a couple existing sources with a bit of glue.

The results are in the merge-demo branch of u-config, and took hardly any work:

$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)

A platform layer, main_wine.c, is a merge of two existing platform layers, one of which required unavoidable tweaks. We’ll get to those details in a moment. First we’ll need to detect if we’re running under Wine, and the best solution I found was to locate ntdll!wine_get_version. If this function exists, we’re in Wine. That works out to a pretty one-liner because ntdll.dll is already loaded:

bool running_on_wine()
{
    return GetProcAddress(GetModuleHandleA("ntdll"), "wine_get_version");
}

An x86-64 Linux syscall wrapper with thorough inline assembly:

ptrdiff_t syscall3(int n, ptrdiff_t a, ptrdiff_t b, ptrdiff_t c)
{
    ptrdiff_t r;
    asm volatile (
        "syscall"
        : "=a"(r)
        : "a"(n), "D"(a), "S"(b), "d"(c)
        : "rcx", "r11", "memory"
    );
    return r;
}

ptrdiff_t write(int fd, void *buf, ptrdiff_t len)
{
    return syscall3(SYS_write, fd, (ptrdiff_t)buf, len);
}

I’d normally use long for all these integers because Linux is LP64 (long is pointer-sized), but Windows is LLP64 (only long long is 64 bits). It’s so bizarre to interface with Linux from LLP64, and this will have consequences later. With these pieces we can see the basic shape of a split personality program:

    if (running_on_wine()) {
        write(1, "hello, wine\n", 12);
    } else {
        HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
        WriteFile(h, "hello, windows\n", 15, 0, 0);
    }

We can cram two programs into this binary and select which program at run time depending on what we see. In typical programs locating and calling into glibc would be a challenge, particularly with the incompatible ABIs involved. We’re avoiding it here by interfacing directly with the kernel.

Application to u-config

Luckily u-config has completely-optional platform layers implemented with Linux system calls. The POSIX platform layer works fine, and that’s what distributions should generally use, but these bonus platforms are unhosted and do not require libc. That means we can shove it into a Windows build with relatively little trouble.

Before we do that, let’s think about what we’re doing. Debian has great cross toolchain support, including Mingw-w64. There are even a few Windows libraries in the Debian package repository, such as zlib, and we can build Windows programs against them. If you’re cross-building and using pkg-config, you ought to use the cross toolchain pkg-config, which in GNU ecosystems gets an architecture prefix like the other cross tools. Debian cross toolchains each include a cross pkg-config, and it sometimes almost works correctly! Here’s what I get on Debian 13:

$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz

Note the architecture in the -I and -L options. It really is querying the cross sysroot. Though these paths are in the cross sysroot, and so should not be listed by pkg-config. It’s unoptimal and indicates this pkg-config is probably misconfigured. In other cases it’s far from correct:

$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...

A tool prefixed x86_64-w64-mingw32- should not produce paths containing x86_64-linux-gnu (the host architecture in this case). Our version won’t have these issues.

The u-config platform interface is five functions:

filemap os_mapfile(os *, arena *, s8 path);  // read whole files
s8node *os_listing(os *, arena *, s8 path);  // list directories
void    os_write(os *, i32 fd, s8);          // standard out/err
void    os_fail(os *);                       // non-zero exit

void uconfig(config *);

Platforms implement the first four functions, and call uconfig() with the platform’s configuration, context pointer (os *), command line arguments, environment, and some memory (all in the config object). My strategy is to link two platforms into the binary, and the first challenge is they both define os_write, etc. I did not plan nor intend for one binary to contain more than one platform layer. Unity builds offer a fix without changing a single line of code:

#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include "main_windows.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include "main_linux_amd64.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

This dirty, but effective trick may look familiar. It also doesn’t interfere with the other builds. Next I define the real platform functions as a dispatch based on our run-time situation:

b32 wine_detected;

filemap os_mapfile(os *ctx, arena *a, s8 path)
{
    if (wine_detected) {
        return linux_mapfile(ctx, a, path);
    } else {
        return win32_mapfile(ctx, a, path);
    }
}

If I were serious about keeping this experiment, I’d lift os as I did the functions (as win32_os, linux_os) and include wine_detected in the context, eliminating this global variable. That cannot be done with simple hacks and macros.

The next challenge is that I wrote the Linux platform layer assuming LP64, and so it uses long instead of an equivalent platform-agnostic type like ptrdiff_t. I never thought this would be an issue because this source literally contains asm blocks and no conditional compilation, yet here we are. Lesson learned. I wanted to try an extremely janky #define on long to fix it, but this source file has a couple long long that won’t play along. These multi-token type names of C are antithetical to its preprocessor! So I adjusted the source manually instead.

The Windows and Linux platform entry points are completely different, both in name and form, and so co-exist naturally. The merged platform layer is a new entry point that will pass control to the appropriate entry point:

void entrypoint(ptrdiff_t *stack);  // Linux
void __stdcall mainCRTStartup();    // Windows

On Linux stack is the initial value of the stack pointer, which points to argc, argv, envp, and auxv. We’ll need construct an artificial “stack” for the Linux platform layer to harvest. On Windows this is the process entry point, and it will find the rest on its own as a normal Windows process. Ultimately this ended up simpler than I expected:

void __stdcall merge_entrypoint()
{
    wine_detected = running_on_wine();
    if (wine_detected) {
        u8 *fakestack[CMDLINE_ARGV_MAX+1];
        c16 *cmd = GetCommandLineW();
        fakestack[0] = (u8 *)(iz)cmdline_to_argv8(cmd, fakestack+1);
        // TODO: append envp to the fake stack
        entrypoint((iz *)fakestack);
    } else {
        mainCRTStartup();
    }
}

Where cmdline_to_argv8 is my Windows argument parser, already used by u-config, and I reserve one element at the front to store argc. Since this is just a proof-of-concept I didn’t bother fabricating and pushing envp onto the fake stack. The Linux entry point doesn’t need auxv and can be omitted. Once in the Linux entry point it’s essentially a Linux process from then on, except the x64 calling convention still in use internally.

Finally, I configure the Linux platform layer for Debian’s cross sysroot:

#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "/usr/x86_64-w64-mingw32/lib"

And that’s it! We have our platform merge. Build (w64devkit):

$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c

On Debian use x86_64-w64-mingw32-gcc for cc. The -e linker option selects the new, higher level entry point. After installing Wine binfmt, here’s how it looks on Debian:

$ ./pkg-config.exe --cflags --libs zlib
-lz

That’s the correct output, but is it using the cross sysroot? Ask it to include the -I argument despite it being in the cross sysroot:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz

Looking good! It passes the pc_path test, too:

$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig

Running this same binary on Windows after installing zlib in w64devkit:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz

Also:

$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig

My Frankenwine is a success!

WebAssembly as a Python extension platform

2026-01-01T21:21:19Z

Software above some complexity level tends to sport an extension language, becoming a kind of software platform itself. Lua fills this role well, and of course there’s JavaScript for web technologies. WebAssembly generalizes this, and any Wasm-targeting programming language can extend a Wasm-hosting application. It has more friction than supplying a script in a text file, but extension authors can write in their language of choice, and use more polished development tools — debugging, testing, etc. — than typically available for a typical extension language. Python is traditionally extended through native code behind a C interface, but it’s recently become practical to extend Python with Wasm. That is we can ship an architecture-independent Wasm blob inside a Python library, and use it without requiring a native toolchain on the host system. Let’s discuss two different use cases and their pitfalls.

Normally we’d extend Python in order to access an external interface that Python cannot access on its own. Wasm runs in a sandbox with no access to the outside world whatsoever, so it obviously isn’t useful for that case. Extensions may also grant Python more speed, which is one of Wasm’s main selling points. We can also use Wasm to access embeddable capabilities written in a different programming language which do not require external access.

For preferred non-WASI Wasm runtime is Volodymyr Shymanskyy’s wasm3. It’s plain old C and very friendly to embedding in the same was as, say, SQLite. Performance is middling, though a C program running on wasm3 is still quite a bit faster than an equivalent Python program. It has Python bindings, pywasm3, but it’s distributed only in source code form. That is, the host machine must have a C toolchain in order to use pywasm3, which defeats my purposes here. If there’s a C toolchain, I might as well just use that instead of going through Wasm.

For the use cases in this article, the best option is wasmtime-py. The distribution includes binaries for Windows, macOS, and Linux on x86-64 and ARM64, which covers nearly all Python installations. Hosts require nothing more than a Python interpreter, no native toolchains. It’s almost as good as having Wasm built into Python itself. In my tests it’s 3x–10x faster than wasm3, so for my first use case the situation is even better. The catch is that it currently weighs ~18MiB (installed), and in the future will likely rival the Python interpreter itself. The API also breaks on a monthly basis, so you’re signing up for the upgrade treadmill lest your own program perishes to bitrot after a couple of years. This article is about version 40.

Usage examples and gotchas

The official examples don’t do anything non-trivial or interesting, and so to figure things out I had to study the documentation, which does not offer many hints. Basic setup looks like this:

import functools
import wasmtime

store    = wasmtime.Store()
module   = wasmtime.Module.from_file(store.engine, "example.wasm")
instance = wasmtime.Instance(store, module, ())
exports  = instance.exports(store)

memory = exports["memory"].get_buffer_ptr(store)
func1  = functools.partial(exports["func1"], store)
func2  = functools.partial(exports["func2"], store)
func3  = functools.partial(exports["func3"], store)

A store is an allocation region from which we allocate all Wasm objects. It is not possible to free individual objects except to discard the whole store. Quite sensible, honestly. What’s not sensible is how often I have to repeat myself, passing the store back into every object in order to use it. These objects are associated with exactly one store and cannot be used with different stores. Use the wrong store and it panics: It’s already keeping track internally! I do not understand why the interface works this way. So to make things simpler, I use functools.partial to bind the store parameter and so get the interface I expect.

The get_buffer_ptr object is a buffer protocol object, and if you’re moving anything other than bytes that’s probably what you want to use to access memory. The usual caveats apply for this object: If you change the memory size you probably want to grab a fresh buffer object. For bytes (e.g. buffers and strings) I prefer the read and write methods.

Because multi-value is still in an experimental state in the Wasm ecosystem, you will likely not pass structs with Wasm. Anything more complicated than scalars will require pointers and copying data in and out of Wasm linear memory. This involves the usual trap that catches nearly everyone: Wasm interfaces make no distinction between pointers and integers, and Wasm runtimes interpret generally interpret all integers as signed. What that means is your pointers are signed unless you take action. Addresses start at 0, so this is bad, bad news.

malloc = functools.partial(exports["func1"], store)

hello = b"hello"
pointer = malloc(len(hello))
assert pointer
memory = exports["memory"].write(store, hello, pointer)  # WRONG!

To make matters worse, wasmtime-py adds its own footgun: The read and write methods adopt the questionable Python convention of negative indices acting from the end. If malloc returns a pointer in the upper half of memory, the negative pointer will pass the bounds check inside write because negative is valid, then quietly store to the wrong address! Doh!

I wondered how common this error, so I searched online. I could find only one non-trivial wasmtime-py use in the wild, in a sandboxed PDF reader. It falls into the negative pointer trap as I expected. Not only that, it’s a buffer overflow into Python’s memory space:

            buf_ptr = malloc(store, len(pdf_data))
            mem_data = memory.data_ptr(store)

            for i, byte in enumerate(pdf_data):
                mem_data[buf_ptr + i] = byte

The data_ptr method returns a non-bounds-checked raw ctypes pointer, so this is actually a double mistake. First, it shouldn’t trust pointers coming out of Wasm if it cares at all about sandboxing. The second is the potential negative pointer, which in this case would write outside of the Wasm memory and in Python’s memory, hopefully seg-faulting.

What’s one to do? Every pointer coming out of Wasm must be truncated with a mask:

pointer = malloc(...) & 0xffffffff   # correct for wasm32!

This interprets the result as unsigned. 64-bit Wasm needs a 64-bit mask, though in practice you will never get a valid negative pointer from 64-bit Wasm. This rule applies to JavaScript as well, where the idiom is:

let pointer = malloc(...) >>> 0

Wasm runtimes cannot help — they lack the necessary information — and this is perhaps a fundamental flaw in Wasm’s design. Once you know about it you see this mistake happening everywhere.

Now that you have a proper address, you can apply it to a buffer protocol view of memory. If you’re using NumPy there are various ways to interact with this memory by wrapping it in NumPy types, though only if you’re on a little endian host. (If you’re on a big endian machine, just give up on running Wasm anyway.) The first use case I have in mind typically involves copying plain Python values in and out. The struct package is quite handy here:

vec2   = malloc(...) & 0xffffffff
memory = exports["memory"].get_buffer_ptr(store)
struct.pack_into(", memory, vec2, x, y)

It fills a similar role to JavaScript DataView. If you’re copying lots of numbers, with CPython it’s faster to construct a custom format string rather than use a loop:

nums: list[int] = ...
struct.pack_into(f"<{len(nums)}i", memory, buf, *nums)

To copy structures back out, use struct.unpack_from. If you’re moving strings, you’ll need to .encode() and .decode() to convert to and from bytes, which are well-suited to read and write.

In practice with real Wasm programs you’re going to be interacting with the “guest” allocator from the outside, to request memory into which you copy inputs for a function. In my examples I’ve used malloc because it requires no elaboration, but as usual a bump allocator solves this so much better, especially because it doesn’t require stuffing a whole general purpose allocator inside the Wasm program. Have one global arena — no other threads will sharing that Wasm instance — rapid fire a bunch of allocations as needed without any concern for memory management in the “host”, call the function, which might allocate a result from that arena, then reset the arena to clean up. In essence a stack for passing values in and out.

WebAssembly as faster Python

Suppose we noticed a computational hot spot in our Python program in a pure Python function (e.g. not calling out to an extension). Optimizing this function would be wise. Based on my experiments if I re-implement that function in C, compile it to Wasm, then run that bit of Wasm in place of the original function, I can expect around a 10x speed-up. In general C is more like 100x faster than Python, and the overhead of interfacing with Wasm — copying stuff in and out, etc. — can be high, but not so high as to not be profitable. This improves further if I can change the interface, e.g. require callers to use the buffer protocol.

Thanks to wasmtime-py, I could introduce this change without fussing with cross-compilers to build distribution binaries, nor require a toolchain on the target, just a hefty Python package. Might be worth it.

My main experimental benchmark is a variation on my solution to the “Two Sum” problem, which I originally wrote for JavaScript, then extended to pywasm3 and later wasmtime-py. It’s simple, just interesting enough, and representative of the sort of Wasm drop-in I have in mind. It has the same interface, but implements it with Wasm.

# Original Pythonic interface
def twosum(nums: list[int], target: int) -> tuple[int, int] | None:
    ...

# Stateful Wasm interface
class TwoSumWasm():
    def __init__(self):
        store    = wasmtime.Store()
        module   = wasmtime.Module.from_file(store.engine, ...)
        instance = wasmtime.Instance(store, module, ())
        ...

    def twosum(self, nums, target):
        # ... use wasm instance ...

There’s some state to it with the Wasm instance in tow. If you hide that by making it global you’ll need to synchronize your threads around it. In a multi-threaded program perhaps these would be lazily-constructed thread locals. I haven’t had to solve this yet.

However, the weakness of the wasmtime “store” really shows: Notice how compilation and instantiation are bound together in one store? ~~I cannot compile once and then create disposable instances on the fly~~, e.g. as required for each run of a WASI program. Every instance permanently extends the compilation store. In practice we must wastefully re-compile the Wasm program for each disposable instance. Despite appearances, compilation and instantiation are not actually distinct steps, as they are in JavaScript’s Wasm API. wasmtime.Instance accepts a store as its first argument, suggesting use of a different store for instantiation. That would solve this problem, but as of this writing it must be the same store used to compile the module. ~~This is a fatal flaw for certain real use cases, particularly WASI.~~

Update: Wolfgang Meier points out the serialize and deserialize methods, which detaches a compiled module from its store, allowing for independent instantations. I tried it, and it’s a practical workaround. Overhead is low; no validation when deserializing. My benchmark now does it for future reference, as I expect it to be my typical use case.

WebAssembly as embedded capabilities

Loup Vaillant’s Monocypher is a wonderful cryptography library. Lean, efficient, and embedding-friendly, so much so it’s distributed in amalgamated form. It requires no libc or runtime, so we can compile it straight to Wasm with almost any Clang toolchain:

$ clang --target=wasm32 -nostdlib -O2 -Wl,--no-entry -Wl,--export-all
        -o monocypher.wasm monocypher.c

It’s not “Wasm-aware” so I need --export-all to expose the interface. This is swell because, as single translation unit, anything with external linkage is the interface. Though remember what I said about interacting with the guest allocator? This has no allocator, nor should it. It’s not so usable in this form because we’d need to manage memory from the outside. Do-able, but it’s easy to improve by adding a couple more functions, sticking to a single translation unit:

#include "monocypher.c"

extern char  __heap_base[];
static char *heap_used;
static char *heap_high;

void *bump_alloc(ptrdiff_t size)
{
    // ...
}

void bump_reset()
{
    ptrdiff_t len = heap_used - __heap_base;
    __builtin_memset(__heap_base, 0, len);  // wipe keys, etc.
    heap_used = __heap_base;
}

I’ve discussed __heap_base before, which is part of the ABI. We’ll push keys, inputs, etc. onto this “stack”, run our cryptography routine, copy out the result, then reset the bump allocator, which wipes out all sensitive data. Often memset is insufficient — typically it’s zero-then-free, and compilers see the lifetime about to end — but no lifetime ends here, and stores to this “heap” memory externally observable as far as the abstract machine can tell. (Otherwise we couldn’t reliably copy out our results!)

There’s a lot to this API, but I’m only going to look at the AEAD interface. We “lock” up some data in an encrypted box, write any unencrypted label we’d like on the outside. Then later we can unlock the box, which will only open for us if neither the contents of the box nor the label were tampered with. That’s some solid API design:

void crypto_aead_lock(uint8_t       *cipher_text,
                      uint8_t        mac  [16],
                      const uint8_t  key  [32],
                      const uint8_t  nonce[24],
                      const uint8_t *ad,         size_t ad_size,
                      const uint8_t *plain_text, size_t text_size);
int crypto_aead_unlock(uint8_t       *plain_text,
                       const uint8_t  mac  [16],
                       const uint8_t  key  [32],
                       const uint8_t  nonce[24],
                       const uint8_t *ad,          size_t ad_size,
                       const uint8_t *cipher_text, size_t text_size);

By compiling to Wasm we can access this functionality from Python almost like it was pure Python, and interact with other systems using Monocypher.

Since Monocypher does not interact with the outside world on its own, it relies on callers to use their system’s CSPRNG to create those nonces and keys, which we’ll do using the secrets built-in package:

class Monocypher:
    def __init__(self):
        ...
        self._read   = functools.partial(memory.read, store)
        self._write  = functools.partial(memory.write, store)
        self.__alloc = functools.partial(exports["bump_alloc"], store)
        self._reset  = functools.partial(exports["bump_reset"], store)
        self._lock   = functools.partial(exports["crypto_aead_lock"], store)
        self._unlock = functools.partial(exports["crypto_aead_unlock"], store)
        self._csprng = secrets.SystemRandom()

    def _alloc(self, n):
        return self.__alloc(n) & 0xffffffff

    def generate_key(self):
        return self._csprng.randbytes(32)

    def generate_nonce(self):
        return self._csprng.randbytes(24)

    ...

With a solid foundation, all that follows comes easily. A finally guarantees secrets are always removed from Wasm memory, and the rest is just about copying bytes around:

    def aead_lock(self, text, key, ad = b""):
        assert len(key) == 32
        try:
            macptr   = self._alloc(16)
            keyptr   = self._alloc(32)
            nonceptr = self._alloc(24)
            adptr    = self._alloc(len(ad))
            textptr  = self._alloc(len(text))

            self._write(key, keyptr)
            nonce = self.generate_nonce()
            self._write(nonce, nonceptr)
            self._write(ad,    adptr)
            self._write(text,  textptr)

            self._lock(
                textptr,
                macptr,
                keyptr,
                nonceptr,
                adptr, len(ad),
                textptr, len(text),
            )
            return (
                self._read(macptr, macptr+16),
                nonce,
                self._read(textptr, textptr+len(text)),
            )
        finally:
            self._reset()

And aead_unlock is basically the same in reverse, but throws if the box fails to unlock, perhaps due to tampering:

    def aead_unlock(self, text, mac, key, nonce, ad = b""):
        assert len(mac) == 16
        assert len(key) == 32
        assert len(nonce) == 24
        try:
            macptr   = self._alloc(16)
            keyptr   = self._alloc(32)
            nonceptr = self._alloc(24)
            adptr    = self._alloc(len(ad))
            textptr  = self._alloc(len(text))

            self._write(mac, macptr)
            self._write(key, keyptr)
            self._write(nonce, nonceptr)
            self._write(ad, adptr)
            self._write(text, textptr)

            if self._unlock(
                textptr,
                macptr,
                keyptr,
                nonceptr,
                adptr, len(ad),
                textptr, len(text),
            ):
                raise ValueError("AEAD mismatch")
            return self._read(textptr, textptr+len(text))
        finally:
            self._reset()

Usage:

mc = Monocypher()
key = mc.generate_key()
message = "Hello, world!"
mac, nonce, encrypted = mc.aead_lock(message.encode(), key)

Transmit mac, nonce, and encrypted to the other party (or your future self), who already has the key:

decrypted = mc.aead_unlock(encrypted, mac, key, nonce)

Find the complete source in my scratch repository.

While I have a few reservations about wasmtime-py, it fascinates me how well this all works. It’s been my hammer in search of a nail for some time now.

Freestyle linked lists tricks

2025-12-31T11:59:59Z

Linked lists are a data structure basic building block, with especially flexible allocation behavior. They’re not just a useful starting point, but sometimes a sound foundation for future growth. I’m going to start with the beginner stuff, then without disrupting the original linked list, enhance it with new capabilities.

Linked list basics

For the sake of an interesting example, I’m will demonstrate with the same concept as last time I talked about data structures: a collection of key/value strings, in the form of an environment variables. This time in linked list form:

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

uint64_t hash64(Str);
bool     equals(Str, Str);

typedef struct Env Env;
struct Env {
    Env *next;
    Str  key;
    Str  value;
};

It will be sourced from some string, formatted like the env program:

    Str input = S(
        "EDITOR=vim\n"
        "HOME=/home/user\n"
        "PATH=/bin:/usr/bin\n"
        "SHELL=/bin/bash\n"
        "TERM=xterm-256color\n"
        "USER=user\n"
        "SHELL=/bin/sh\n"   // <- repeated entry
    );

And all the parser heavy lifting will be done by our ever-handy cut function:

typedef struct {
    Str tail;
    Str head;
} Cut;

Cut cut(Str, char);

The simplest way to build up a linked list is like a stack, pushing objects into the front. Zero-initialized head pointer, point the new node at it, then make that node the new head element:

Env *parse_reversed(Str s, Arena *a)
{
    Env *head = 0;  // 1
    for (Cut line = {s}; line.tail.len;) {
        line = cut(line.tail, '\n');
        Cut  pair  = cut(line.head, '=');
        Env *env   = new(a, 1, Env);
        env->key   = pair.head;
        env->value = pair.tail;
        env->next  = head;  // 2
        head = env;  // 3
    }
    return head;
}

That’s it, a complete linked list implementation in three lines of code. No big deal. Because of the bump allocator, nodes are packed in order in memory, so the usual cache objections for linked lists do not apply. LIFO semantics mean the linked list is in reverse order from the source order. If we’re doing a linear scan through the linked list, the last entry in the source wins, which may be what you wanted:

Str lookup_linear(Env *env, Str key)
{
    for (Env *var = env; var; var = var->next) {
        if (equals(key, var->key)) {
            return var->value;
        }
    }
    return (Str){};
}

    // ...
    Env *env  = parse_reversed(input, &scratch);
    Str value = lookup_linear(env, S("SHELL"));  // <- "/bin/sh"

It’s just one more line of code to maintain the original order, using a very simple double-pointer technique:

Env *parse_ordered(Str s, Arena *a)
{
    Env  *head = 0;  // 1
    Env **tail = &head;  // 2
    for (Cut line = {s}; line.tail.len;) {
        // ...
        *tail = env;  // 3
        tail = &env->next;  // 4
    }
    return head;
}

No branches necessary, nor dummy nodes. A pointer to the last pointer in the list works even for empty lists. The tail pointer is unneeded once the list is complete. This form has queue behavior.

Faster look-up with a tree

If you’re doing many look-ups, or if the list is long, those linear scans to find items in the list are not ideal. We can introduce an intrusive hash map, in the form of a hash trie, by adding two more pointers to the linked list:

typedef struct Env Env;
struct Env {
    Env *next;
    Env *child[2];  // <- hash map linkage
    Str  key;
    Str  value;
};

I’ve found it’s simplest to construct a node into the hash map, then link it onto the list tail. That constructor looks like this:

Env *new_env(Arena *a, Env **env, Str key, Str value)
{
    for (uint64_t h = hash64(key); *env; h <<= 1) {
        env = &(*env)->child[h>>63];
    }
    *env = new(a, 1, Env);
    (*env)->key = key;
    (*env)->value = value;
    return *env;
}

Then we swap that into the head/tail version in place of the original new macro call:

Env *parse_mapped(Str s, Arena *a)
{
    Env  *head = 0;
    Env **tail = &head;
    for (Cut line = {s}; line.tail.len;) {
        // ...
        Env *env = new_env(a, &head, pair.head, pair.tail);
        *tail = env;
        tail = &env->next;
    }
    return head;
}

This is now a linked list and a hash map at the same time, built-up piece by piece without any resizing. We still have the original linked list, but we can now search it in log time. The look-up function resembles the constructor:

Str lookup_logn(Env *env, Str key)
{
    for (uint64_t h = hash64(key); env; h <<= 1) {
        if (equals(key, env->key)) {
            return env->value;
        }
        env = env->child[h>>63];
    }
    return (Str){};
}

Because of the FIFO semantics, it finds the first match in the source:

    Env *env   = parse_mapped(input, &scratch);
    Str  value = lookup_logn(env, S("SHELL"));  // <- /bin/bash

The other matches are also in the tree, and we can find those as well by continuing traversal. That is, it’s already a multi-map. This particular interface can’t pick up where it left off, but we can build one that does using an iterator/cursor:

typedef struct {
    uint64_t hash;
    Str      key;
    Env     *env;
} EnvIter;

EnvIter new_enviter(Env *env, Str key)
{
    return (EnvIter){hash64(key), key, env};
}

Str enviter_next(EnvIter *it)
{
    while (it->env) {
        Env *cur = it->env;
        it->env = it->env->child[it->hash>>63];
        it->hash <<= 1;
        if (equals(it->key, cur->key)) {
            return cur->value;
        }
    }
    return (Str){};
}

Update: Thanks to Daniel Kareh for a correction.

Then we can use a loop to visit every match in source order:

    Env *env = parse_mapped(input, &scratch);
    for (EnvIter it = new_enviter(env, S("SHELL"));;) {
        Str value = enviter_next(&it);
        if (!value.data) break;
        // ...
    }

Faster look-up with an index table

If the list is static once constructed, or if look-ups happen much more frequently than the list grows, we can find list items even faster by constructing an index table over the list: an MSI hash table. This table avoids redundancy by sharing structure with the list. Because it’s a flat table, if we keep adding to the list then eventually we’ll need to reconstruct a larger table when it becomes overloaded.

The table itself has a very simple structure, just an array and its size, expressed as a power-of-two exponent:

typedef struct {
    Env **slots;
    int   exp;
} EnvTable;

We do not need the child nodes, and so linked list nodes are untouched. That is, it’s not intrusive. In fact, we can build any arbitrary number of tables over a list, perhaps indexing different properties for different sorts of queries. The idea is that we build the list first, then create the table:

EnvTable new_table(Arena *a, Env *env)
{
    // Compute list length
    ptrdiff_t len = 0;
    for (Env *var = env; var; var = var->next) {
        len++;
    }

    // Then compute an appropriate table size
    EnvTable table = {};
    table.exp = 3;
    ptrdiff_t one = 1;
    for (; (one<<table.exp) - (one<<(table.exp-3)) < len; table.exp++) {}
    table.slots = new(a, one<<table.exp, Env *);

    // Then insert linked list items into the table
    for (Env *var = env; var; var = var->next) {
        uint64_t hash = hash64(var->key);
        size_t   mask = ((size_t)1 << table.exp) - 1;
        size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
        for (size_t i = (size_t)hash;;) {
            i = (i + step) & mask;
            if (!table.slots[i]) {
                table.slots[i] = var;
                break;
            }
        }
    }

    return table;
}

Note how only searches for an empty slot, not for a matching entry. That’s because this too is a multi-map, also with elements in insertion order. Look-ups are constant time:

Str lookup_constant(EnvTable table, Str key)
{
    uint64_t hash = hash64(key);
    size_t   mask = ((size_t)1 << table.exp) - 1;
    size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
    for (size_t i = (size_t)hash;;) {
        i = (i + step) & mask;
        if (!table.slots[i]) {
            return (Str){};
        } else if (equals(table.slots[i]->key, key)) {
            return table.slots[i]->value;
        }
    }
}

It finds the earliest match in the list, meaning an index over the “reverse” list will find the last entry in the source. The indexed-over property is the input to hash64 and equals. By using a different input to these functions we could build another table on, say, value length if that’s a property on which we needed to find elements efficiently. Again, for multi-map iteration we need some kind of iterator or cursor:

typedef struct {
    EnvTable table;
    Str      key;
    size_t   step;
    size_t   i;
} TableIter;

TableIter new_tableiter(EnvTable table, Str key)
{
    uint64_t hash = hash64(key);
    size_t   step = (size_t)(hash >> (64 - table.exp)) | 1;
    size_t   idx  = (size_t)hash;
    return (TableIter){table, key, step, idx};
}

Str table_next(TableIter *it)
{
    size_t mask  = ((size_t)1 << it->table.exp) - 1;
    Env  **slots = it->table.slots;
    for (;;) {
        it->i = (it->i + it->step) & mask;
        if (!slots[it->i]) {
            return (Str){};
        } else if (equals(slots[it->i]->key, it->key)) {
            return slots[it->i]->value;
        }
    }
}

Its usage looks just like the other multi-map:

    Env *env = parse_ordered(input, &scratch);
    EnvTable table = new_table(&scratch, env);
    for (TableIter it = new_tableiter(table, S("SHELL"));;) {
        Str value = table_next(&it);
        if (!value.data) break;
        // ...
    }

With these techniques at hand, I can start with linked lists when they are convenient, and later add needed features without fundamentally changing the underlying data structure. None of this requires runtime support, and so it fits comfortably on embedded systems, tiny WebAssembly programs, etc. All the above code is available ready to run: list.c.