<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>null program</title>
  <link rel="alternate" type="text/html" href="https://nullprogram.com"/>
  <link rel="self" type="application/atom+xml" href="https://nullprogram.com/feed/"/>
  <updated>2026-04-09T13:25:45Z</updated>
  <id>urn:uuid:f8b65823-4ec5-3a70-efc8-2b713aa63091</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com/</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
  <entry>
    <title>dcmake: a new CMake debugger UI</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/04/07/"/>
    <id>urn:uuid:eb448519-0a55-4c1c-bc55-17a65634224f</id>
    <updated>2026-04-07T03:04:02Z</updated>
    <category term="cpp"/>
    <content type="html">
      <![CDATA[<p>CMake has a <a href="https://cmake.org/cmake/help/latest/manual/cmake.1.html#cmdoption-cmake-debugger"><code class="language-plaintext highlighter-rouge">--debugger</code> mode</a> since <a href="https://cmake.org/cmake/help/latest/release/3.27.html#debugger">3.27</a> (July 2023),
allowing software to manipulate it interactively through the <a href="https://microsoft.github.io/debug-adapter-protocol/">Debugger
Adaptor Protocol</a> (DAP), an HTTP-like protocol passing JSON messages.
Debugger front-ends can start, stop, step, breakpoint, query variables,
etc. a live CMake. When I came across this mode, I immediately conceived a
project putting it to use. Thanks to <a href="/blog/2026/03/29/">recent leaps in software engineering
productivity</a>, I had a working prototype in 30 minutes, and by the
end of that same day, a complete, multi-platform, native, GUI application.
I named it <strong><a href="https://github.com/skeeto/dcmake">dcmake</a></strong> (“debugger for CMake”). I’ve tested it on macOS,
Windows, and Linux. Despite only being couple days old, it’s one of the
coolest things I’ve ever built. Prior to 2026, I estimate it would have
taken me a month to get the tool to this point.</p>

<p><a href="/img/dcmake/dcmake.png"><img src="/img/dcmake/dcmake-thumb.png" alt="" /></a></p>

<p>It has a <a href="https://github.com/ocornut/imgui">Dear ImGui</a> interface, which I’ve experienced as a user but
never built on myself before. Specifically the <a href="https://github.com/ocornut/imgui/wiki/Docking">docking branch</a>. In a
sense it’s a toolkit for building debuggers, so it’s playing an enormous
role in how quickly I put this project together. All of the “windows” tear
out and may be free-floating or docked wherever you like, closely matching
the classic Visual Studio UI. I borrowed all the same keybindings: F10 to
step over, F11 to step in, F5 to start/continue, shift+F5 to stop. Click
on line numbers to toggle breakpoints, right click to run-to-line, hover
over variables with the mouse to see their values. Nearly every every UI
state persists across sessions, and it opens nearly instantly.</p>

<video src="/vid/dcmake.mp4" loop="" muted="" autoplay=""></video>

<p>This is just one of many situations I’ve used AI the past month for UI
development, and it’s been shockingly effective. I can describe roughly
the interface I want, and the AI makes it happen in a matter of minutes.
It understands what I mean, filling in the details, sometimes anticipating
what I’ll ask for next. If I’m unsure how I want a UI to work, it also
offers good advice. If I need simple icons and such, it can draw those,
too. It’s all incredibly empowering.</p>

<p>On macOS and Linux it runs on top of GLFW with OpenGL 3 rendering, and on
Windows it uses native Win32 windowing and DirectX 11 rendering.</p>

<p>Program arguments given to dcmake populate the top-left arguments text
input, which go straight into CMake on start. So you can prepend <code class="language-plaintext highlighter-rouge">d</code> to
your CMake configuration command to run it inside the debugger. Passing no
arguments sets it up for “standard” <code class="language-plaintext highlighter-rouge">-B build</code> configuration.</p>

<p>In general, if you don’t have anywhere in particular to look, likely the
first thing to do after starting dcmake (in a project) is press F10. It
starts CMake paused on the first line of <code class="language-plaintext highlighter-rouge">CMakeLists.txt</code>, or whatever
script you’re debugging. If you’re trying out dcmake for the first time,
that’s a good place to start. Keep pressing F10 to step through that
script, watching it run through its configuration. If you F11 through the
script then you’ll dive deeper and deeper into CMake itself, which can be
insightful.</p>

<p>There is no point in trying to debug <code class="language-plaintext highlighter-rouge">--build</code> invocations. It’s just a
uniform interface to the underlying build tool, and there is no CMake left
to debug at that point. However, it <em>does</em> work with <code class="language-plaintext highlighter-rouge">-P</code> <a href="https://cmake.org/cmake/help/latest/manual/cmake.1.html#cmdoption-cmake-P">script mode</a>
invocations. CMake can operate as a <a href="https://claude.ai/public/artifacts/06b50c8f-ff71-4562-8ab5-80adaddff9b7">platform-agnostic shell script-like
tool</a>, but unlike shell scripts you can step through them with a
debugger like dcmake.</p>

<p>On Windows it supports Unicode paths all the way through, without <a href="/blog/2021/12/30/">a UTF-8
manifest</a>. This took some <a href="/blog/2022/02/18/">special care</a>, in particular
avoiding any C++ standard library I/O functionality. Current frontier AI
cannot handle this detail on their own. The macOS platform required a bit
of Objective-C, as it often does, and I’m happy I didn’t have to figure
that part out myself.</p>

<p>The next release of <a href="https://github.com/skeeto/w64devkit">w64devkit</a> will include dcmake, complementing its
recent addition of CMake. This new tool has already proven useful in its
own development.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>2026 has been the most pivotal year in my career… and it's only March</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/03/29/"/>
    <id>urn:uuid:91d679b3-4f07-4b61-b359-5890695ad621</id>
    <updated>2026-03-29T21:38:22Z</updated>
    <category term="ai"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>In February I left my employer after nearly two decades of service. In the
moment I was optimistic, yet unsure I made the right choice. Dust settled,
I’m now absolutely sure I chose correctly. I’m happier and better for it.
There were multiple factors, but it’s not mere chance it coincides with
these early months of <a href="https://shumer.dev/something-big-is-happening">the automation of software engineering</a>. I
left an employer that is <em>years behind</em> adopting AI to one actively
supporting and encouraging it. As of March, in my professional capacity
<strong>I no longer write code myself</strong>. My current situation was unimaginable
to me only a year ago. Like it or not, this is the future of software
engineering. Turns out I like it, and having tasted the future I don’t
want to go back to the old ways.</p>

<p>In case you’re worried, this is still me. These are my own words. <a href="https://paulgraham.com/writes.html">Writing
is thinking</a>, and it would defeat the purpose for an AI to write
in my place on my personal blog. That’s not going to change.</p>

<p>I still spend much time reading and understanding code, and using most of
the same development tools. It’s more like being a manager, orchestrating
a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing
the vegetables, I conjure a helper to do it while I continue to run the
kitchen. I haven’t managed people in some 20 years now, but I can feel
those old muscles being put to use again as I improve at this new role.
Will these kitchens still need human chefs like me by the end of the
decade? Unclear, and it’s something we all need to prepare for.</p>

<p>My situation gave me an experience onboarding with AI assistance — a fast
process given a near-instant, infinitely-patient helper answering any
question about the code. By second week I was making substantial, wide
contributions to the large C++ code base. It’s difficult to attach a
quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for
certain this wouldn’t have been possible without AI. The bottlenecks have
shifted from producing code, which now takes relatively no time at all, to
other points, and we’re all still trying to figure it out.</p>

<p>My personal programming has transformed as well. Everything <a href="/blog/2024/11/10/">I said about
AI in late 2024</a> is, as I predicted, utterly obsolete. There’s a
huge, growing gap between open weight models and the frontier. Models you
can run yourself are toys. In general, almost any AI product or service
worth your attention costs money. The free stuff is, at minimum, months
behind. Most people only use limited, free services, so there’s a broad
unawareness of just how far AI has advanced. AI is <em>now highly skilled at
programming</em>, and better than me at almost every programming task, with
inhumanly-low defect rates. The remaining issues are mainly steering
problems: If AI code doesn’t do what I need, likely the AI writing it
didn’t understand what I needed.</p>

<p>I’ll still write code myself from time to time for fun — <a href="/blog/2018/06/10/">minimalist</a>,
with my <a href="/blog/2023/10/08/">style</a> and <a href="/blog/2025/01/19/">techniques</a> — the same way I play <a href="https://en.wikipedia.org/wiki/Shogi">shogi</a> on
the weekends for fun. However, artisan production is uneconomical in the
presence of industrialization. AI makes programming so cheap that only the
rich will write code by hand.</p>

<p>A small part of me is sad at what is lost. A bigger part is excited about
the possibilities of the future. I’ve always had more ideas than time or
energy to pursue them. With AI at my command, the problem changes shape. I
can comfortably take on complexity from which I previously shied away, and
I can take a shot at any idea sufficiently formed in my mind to prompt an
AI — a whole skill of its own that I’m actively developing.</p>

<p>For instance, a couple weeks ago I <a href="https://github.com/skeeto/w64devkit/pull/357">put AI to work on a problem</a>,
and it produced a working solution for me after ~12 hours of continuous,
autonomous work, literally while I slept. The past month <a href="https://github.com/skeeto/w64devkit">w64devkit</a> has
burst with activity, almost entirely AI-driven. Some of it architectural
changes I’ve wanted for years, but would require hours of tedious work,
and so I never got around to it. AI knocked it out in minutes, with the
new architecture opening new opportunities. It’s also taken on most of the
cognitive load of maintenance.</p>

<h3 id="quiltcpp">Quilt.cpp</h3>

<p>So far the my biggest, successful undertaking is <strong><a href="https://github.com/skeeto/quilt.cpp">Quilt.cpp</a></strong>, a C++
clone of <a href="https://savannah.nongnu.org/projects/quilt">Quilt</a>, an early, actively-used source control system for
patch management. Git is a glaring omission from the <a href="/blog/2020/09/25/">almost</a> complete
w64devkit, due platform and build issues. I’ve thought Quilt could fill
<em>some</em> of that source control hole, except the original is written in
Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since
Quilt is conceptually simple, and I could lean on <a href="https://frippery.org/busybox/">busybox-w32</a> <code class="language-plaintext highlighter-rouge">diff</code>
and <code class="language-plaintext highlighter-rouge">patch</code>, I’ve considered writing my own implementation, just <a href="/blog/2023/01/18/">as I did
pkg-config</a>, but I never found the energy to do it.</p>

<p>Then I got good enough with AI to knock out a near feature-complete clone
in about four days, including a built-in <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code> so it doesn’t
actually depend on external tools (except invoking <code class="language-plaintext highlighter-rouge">$EDITOR</code>). On Windows
it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases.
The source is distributed as an amalgamation, a single file <code class="language-plaintext highlighter-rouge">quilt.cpp</code>
per its namesake:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file] &lt;command&gt; [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt &lt;command&gt; --help" for details on a specific command.
</code></pre></div></div>

<p>It supports Windows and POSIX, and runs ~5x faster than the original. AI
developed it on Windows, Linux, and macOS: It’s best when the AI can close
the debug loop and tackle problems autonomously without involving a human
slowpoke. The handful of “not implemented” parts aren’t because they’re
too hard — each would probably take an AI ~10 minutes — but deliberate
decisions of taste.</p>

<p>There’s an irony that the reason I could produce Quilt.cpp with such ease
is also a reason I don’t really need it anymore.</p>

<p>I changed the output of <code class="language-plaintext highlighter-rouge">quilt mail</code> to be more Git-compatible. The mbox
produced by Quilt.cpp can be imported into Git with a plain <code class="language-plaintext highlighter-rouge">git am</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox
</code></pre></div></div>

<p>The idea being that I could work on a machine without Git (e.g. Windows
XP), and copy/mail the mbox to another machine where Git can absorb it as
though it were in Git the whole time. <code class="language-plaintext highlighter-rouge">git format-patch</code> to <code class="language-plaintext highlighter-rouge">quilt import</code>
sends commits in the opposite direction, useful for manually testing
Quilt.cpp on real change sets.</p>

<p>To be clear, I could not have done this if the original Quilt did not
exist as a working program. I began with an AI generating a <a href="https://eli.thegreenplace.net/2026/rewriting-pycparser-with-the-help-of-an-llm/">conformance
suite</a> based on the original, its documentation, and other online
documentation, validating that suite against the original implementation
(see <code class="language-plaintext highlighter-rouge">-DQUILT_TEST_EXECUTABLE</code>). Then had another AI code to the tests, on
architectural guidance from me, with <code class="language-plaintext highlighter-rouge">-D_GLIBCXX_DEBUG</code> and sanitizers as
guardrails. That was day one. The next three days were lots of refining
and iteration as I discover the gaps in the test suite. I’d prompt AI to
compare Quilt.cpp to the original Quilt man page, add tests for missing
features, validate the new tests against the original Quilt, then run
several agents to fix the tests. While they worked I’d try the latest
build and note any bugs. As of this writing, the result is about equal
parts test and non-test, ~9KLoC each.</p>

<p>I’m likely to use this technique to clone other tools with implementations
unsuitable for my purposes. I learned quite a bit from this first attempt.</p>

<p>Why C++ instead of my usual choice of C? As we know, <a href="/blog/2023/02/11/">conventional C is
highly error-prone</a>. Even AI has trouble with it. In the ~9k lines
of C++ that is Quilt.cpp, I am only aware of three memory safety errors by
the AI. Two were null-terminated string issues with <code class="language-plaintext highlighter-rouge">strtol</code>, where the AI
was essentially writing C instead of C++, after which I directed the AI to
use <code class="language-plaintext highlighter-rouge">std::from_chars</code> and drop as much direct libc use as possible. (The
other was an unlikely branch with <code class="language-plaintext highlighter-rouge">std::vector::back</code> on an empty vector.)
We can rescue C with better techniques like arena allocation, counted
strings, and slices, but while (current) state of the art AI understands
these things, it cannot work effectively with them in C. I’ve tried. So I
picked C++, and from my professional work I know AI is better at C++ than
me.</p>

<p>Also like a manager, I have not read most of the code, and instead focused
on results, so you might say this was “vibe-coded.” It <em>is</em> thoroughly
tested, though I’m sure there are still bugs to be ironed out, especially
on the more esoteric features I haven’t tried by hand yet.</p>

<h3 id="lets-discuss-tools">Let’s discuss tools</h3>

<p>After opposing CMake for years, you may have noticed the latest w64devkit
now includes CMake and Ninja. What happened? Preparing for my anticipated
employment change, this past December I read <a href="https://crascit.com/professional-cmake/"><em>Professional CMake</em></a>.
I realized that my practical problems with CMake were that nearly everyone
uses it incorrectly. Most CMake builds are a disaster, but my new-found
knowledge allows me to navigate the common mistakes. Only high profile
open source projects manage to put together proper CMake builds. Otherwise
the internet is loaded with CMake misinformation. Similar to AI, if you’re
not paying for CMake knowledge then it’s likely wrong or misleading. So I
highly recommend that book!</p>

<p>Frontier AI is <em>very good</em> with CMake. When a project has a CMake build
that isn’t <em>too</em> badly broken, just tell AI to fix it, <em>without any
specifics</em>, and build problems disappear in mere minutes without having to
think about it. It’s awesome. Combine it with the previous discussion
about tests making AI so much more effective, and that it <em>also</em> knows
CTest well, and you’ve got a killer formula. I’m more effective with CTest
myself merely from observing how AI uses it. AI (currently) cannot use
debuggers, so putting powerful, familiar testing tools in its hands helps
a lot, versus the usual bespoke, debugger-friendly solutions I prefer.</p>

<p>Similar to solving CMake problems: Have a hairy merge conflict? Just ask
AI resolve it. It’s like magic. I no longer fear merge conflicts.</p>

<p>So part of my motivation for adding CMake to w64devkit was anticipation of
projects like Quilt.cpp, where they’d be available to AI, or at least so I
could use the tools the AI used to build/test myself. It’s already paid
for itself, and there’s more to come.</p>

<p>For agent software, on personal projects I’m using Claude Code. It’s a
great value, cheaper than paying API rates but requires working around
5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so
much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to
have enough to explore all my ideas. Be warned: <strong>Anthropic software is
quite buggy, more so than industry average</strong>, and it’s obvious that they
never even <em>start</em>, let alone test, some of their released software on
disfavored platforms (Windows, Android). Don’t expect to use Claude Code
effectively for native Windows platform development, which sadly includes
w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a
bottleneck on QA, and unable to fit AI in that role they don’t bother. You
can theoretically report bugs on GitHub, but they’re just ignored and
closed. (Why don’t they have AI agents jumping on this wealth of bug
reports?)</p>

<p>At work I’m using Cursor where I get a choice of models. My favorite for
March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude
Code by a small margin. It’s immediately obvious that Cursor is better
agent software than Claude Code. It’s more robust, more featureful, and
with a clearer UI than Claude Code. It has no trouble on Windows and can
drive w64devkit flawlessly. It’s also more expensive than Claude Code. My
employer currently spends ~US$250/mo on my AI tokens, dirt cheap
considering what they’re getting out of it. I have bottlenecks elsewhere
that keep me from spending even more.</p>

<p>As a general rule, for software engineering always use the smartest model
available. The cheaper, dumber models cost more in the long run. It takes
more tokens to achieve worse results, which costs more human time to sort
out.</p>

<p>Neither Cursor nor Claude Code are open source, so what are the purists to
do, even if they’re willing to pay API rates for tokens? Sadly I have no
answers for you. I haven’t gotten any open source agent software actually
working, and it seems they may lack the necessary secret sauce.</p>

<p>Update: Several folks suggested I give <a href="https://opencode.ai/">OpenCode</a> another shot, and this
time I got over the configuration hurdle. Single executable, slick
interface, and unlike Claude Code, I observed no bugs in my brief trial.
Give that a shot if you’re looking for an open source client.</p>

<p>The future is going to be weird. My experience is only a peek at what’s to
come, and my head is still spinning. However, the more I adapt to the
changes, the better I feel. If you’re feeling anxious like I was, don’t
flinch from improving your own AI knowledge and experience.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Frankenwine: Multiple personas in a Wine process</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/19/"/>
    <id>urn:uuid:d2b53f8d-88a6-400b-a748-693a758741c5</id>
    <updated>2026-01-19T21:51:38Z</updated>
    <category term="c"/><category term="win32"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I came across a recent article on <a href="https://gpfault.net/posts/drunk-exe.html">making Linux system calls from a Wine
process</a>. Windows programs running under Wine are still normal Linux
processes and may interact with the Linux kernel like any other process.
None of this was surprising, and the demonstration works just as I expect.
Still, it got the wheels spinning and I realized an <em>almost</em> practical
application: build <a href="/blog/2023/01/18/">my pkg-config implementation</a> such that on Windows
<code class="language-plaintext highlighter-rouge">pkg-config.exe</code> behaves as a native pkg-config, but when run under Wine
this same binary takes the persona of a Linux program and becomes a cross
toolchain pkg-config, bypassing Win32 and talking directly with the Linux
kernel. <a href="https://justine.lol/cosmopolitan/">Cosmopolitcan Libc</a> cleverly does this out-of-the-box, but
in this article we’ll mash together a couple existing sources with a bit
of glue.</p>

<p>The results are in <a href="https://github.com/skeeto/u-config/commit/e0008d7e">the merge-demo branch</a> of u-config, and took
hardly any work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)
</code></pre></div></div>

<p>A platform layer, <code class="language-plaintext highlighter-rouge">main_wine.c</code>, is a merge of two existing platform
layers, one of which required unavoidable tweaks. We’ll get to those
details in a moment. First we’ll need to detect if we’re running under
Wine, and <a href="https://web.archive.org/web/20250923061634/https://stackoverflow.com/questions/7372388/determine-whether-a-program-is-running-under-wine-at-runtime/42333249#42333249">the best solution I found</a> was to locate
<code class="language-plaintext highlighter-rouge">ntdll!wine_get_version</code>. If this function exists, we’re in Wine. That
works out to a pretty one-liner because <code class="language-plaintext highlighter-rouge">ntdll.dll</code> is already loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">running_on_wine</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">GetModuleHandleA</span><span class="p">(</span><span class="s">"ntdll"</span><span class="p">),</span> <span class="s">"wine_get_version"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>An x86-64 Linux syscall wrapper with <a href="/blog/2024/12/20/">thorough inline assembly</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">syscall3</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">b</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="p">(</span><span class="kt">ptrdiff_t</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’d normally use <code class="language-plaintext highlighter-rouge">long</code> for all these integers because Linux is <a href="https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models">LP64</a>
(<code class="language-plaintext highlighter-rouge">long</code> is pointer-sized), but Windows is LLP64 (only <code class="language-plaintext highlighter-rouge">long long</code> is 64
bits). It’s so bizarre to interface with Linux from LLP64, and this will
have consequences later. With these pieces we can see the basic shape of a
split personality program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">running_on_wine</span><span class="p">())</span> <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello, wine</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
        <span class="n">WriteFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"hello, windows</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We can cram two programs into this binary and select which program at run
time depending on what we see. In typical programs locating and calling
into glibc would be a challenge, particularly with the incompatible ABIs
involved. We’re avoiding it here by interfacing directly with the kernel.</p>

<h3 id="application-to-u-config">Application to u-config</h3>

<p>Luckily u-config has completely-optional platform layers implemented with
Linux system calls. The POSIX platform layer works fine, and that’s what
distributions should generally use, but these bonus platforms are unhosted
and do not require libc. That means we can shove it into a Windows build
with relatively little trouble.</p>

<p>Before we do that, let’s think about what we’re doing. <a href="/blog/2021/08/21/">Debian has great
cross toolchain support</a>, including Mingw-w64. There are even a few
Windows libraries in the Debian package repository, <a href="https://packages.debian.org/trixie/x32/libz-mingw-w64">such as zlib</a>, and
we can build Windows programs against them. If you’re cross-building and
using pkg-config, you ought to use the cross toolchain pkg-config, which
in GNU ecosystems gets an architecture prefix like the other cross tools.
Debian cross toolchains each include a cross pkg-config, and it sometimes
<em>almost</em> works correctly! Here’s what I get on Debian 13:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz
</code></pre></div></div>

<p>Note the architecture in the <code class="language-plaintext highlighter-rouge">-I</code> and <code class="language-plaintext highlighter-rouge">-L</code> options. It really is querying
the <a href="https://peter0x44.github.io/posts/cross-compilers/">cross sysroot</a>. Though these paths are in the cross sysroot,
and so should not be listed by pkg-config. It’s unoptimal and indicates
this pkg-config is probably misconfigured. In other cases it’s far from
correct:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...
</code></pre></div></div>

<p>A tool prefixed <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-</code> should not produce paths containing
<code class="language-plaintext highlighter-rouge">x86_64-linux-gnu</code> (the host architecture in this case). Our version won’t
have these issues.</p>

<p>The u-config platform interface is five functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// read whole files</span>
<span class="n">s8node</span> <span class="o">*</span><span class="nf">os_listing</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// list directories</span>
<span class="kt">void</span>    <span class="nf">os_write</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>          <span class="c1">// standard out/err</span>
<span class="kt">void</span>    <span class="nf">os_fail</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">);</span>                       <span class="c1">// non-zero exit</span>

<span class="kt">void</span> <span class="nf">uconfig</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Platforms implement the first four functions, and call <code class="language-plaintext highlighter-rouge">uconfig()</code> with
the platform’s configuration, context pointer (<code class="language-plaintext highlighter-rouge">os *</code>), command line
arguments, environment, and some memory (all in the <code class="language-plaintext highlighter-rouge">config</code> object). My
strategy is to link two platforms into the binary, and the first challenge
is they both define <code class="language-plaintext highlighter-rouge">os_write</code>, etc. I did not plan nor intend for one
binary to contain more than one platform layer. Unity builds offer a fix
without changing a single line of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include</span> <span class="cpf">"main_windows.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span>
<span class="cp">#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include</span> <span class="cpf">"main_linux_amd64.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span></code></pre></div></div>

<p>This dirty, but effective trick <a href="/blog/2025/02/05/">may look familiar</a>. It also doesn’t
interfere with the other builds. Next I define the real platform functions
as a dispatch based on our run-time situation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="n">wine_detected</span><span class="p">;</span>

<span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">linux_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">win32_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I were serious about keeping this experiment, I’d lift <code class="language-plaintext highlighter-rouge">os</code> as I did
the functions (as <code class="language-plaintext highlighter-rouge">win32_os</code>, <code class="language-plaintext highlighter-rouge">linux_os</code>) and include <code class="language-plaintext highlighter-rouge">wine_detected</code> in
the context, eliminating this global variable. That cannot be done with
simple hacks and macros.</p>

<p>The next challenge is that I wrote the Linux platform layer assuming LP64,
and so it uses <code class="language-plaintext highlighter-rouge">long</code> instead of an equivalent platform-agnostic type like
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code>. I never thought this would be an issue because this source
literally contains <code class="language-plaintext highlighter-rouge">asm</code> blocks and no conditional compilation, yet here
we are. Lesson learned. I wanted to try an extremely janky <code class="language-plaintext highlighter-rouge">#define</code> on
<code class="language-plaintext highlighter-rouge">long</code> to fix it, but this source file has a couple <code class="language-plaintext highlighter-rouge">long long</code> that won’t
play along. These multi-token type names of C are antithetical to its
preprocessor! So I adjusted the source manually instead.</p>

<p>The Windows and Linux platform entry points are completely different, both
in name and form, and so co-exist naturally. The merged platform layer is
a new entry point that will pass control to the appropriate entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">entrypoint</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>  <span class="c1">// Linux</span>
<span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">mainCRTStartup</span><span class="p">();</span>    <span class="c1">// Windows</span>
</code></pre></div></div>

<p>On Linux <code class="language-plaintext highlighter-rouge">stack</code> is <a href="/blog/2025/03/06/">the initial value of the stack pointer</a>, which
<a href="https://articles.manugarg.com/aboutelfauxiliaryvectors">points to <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, <code class="language-plaintext highlighter-rouge">envp</code>, and <code class="language-plaintext highlighter-rouge">auxv</code></a>. We’ll need construct
an artificial “stack” for the Linux platform layer to harvest. On Windows
this is <a href="/blog/2023/02/15/">the process entry point</a>, and it will find the rest on its
own as a normal Windows process. Ultimately this ended up simpler than I
expected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">merge_entrypoint</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">wine_detected</span> <span class="o">=</span> <span class="n">running_on_wine</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="o">*</span><span class="n">fakestack</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">c16</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">GetCommandLineW</span><span class="p">();</span>
        <span class="n">fakestack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)(</span><span class="n">iz</span><span class="p">)</span><span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">fakestack</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="c1">// TODO: append envp to the fake stack</span>
        <span class="n">entrypoint</span><span class="p">((</span><span class="n">iz</span> <span class="o">*</span><span class="p">)</span><span class="n">fakestack</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">mainCRTStartup</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Where <a href="/blog/2022/02/18/"><code class="language-plaintext highlighter-rouge">cmdline_to_argv8</code> is my Windows argument parser</a>, already
used by u-config, and I reserve one element at the front to store <code class="language-plaintext highlighter-rouge">argc</code>.
Since this is just a proof-of-concept I didn’t bother fabricating and
pushing <code class="language-plaintext highlighter-rouge">envp</code> onto the fake stack. The Linux entry point doesn’t need
<code class="language-plaintext highlighter-rouge">auxv</code> and can be omitted. Once in the Linux entry point it’s essentially
a Linux process from then on, except the x64 calling convention still in
use internally.</p>

<p>Finally, I configure the Linux platform layer for Debian’s cross sysroot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include</span><span class="cpf">"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "</span><span class="c1">/usr/x86_64-w64-mingw32/lib"</span><span class="cp">
</span></code></pre></div></div>

<p>And that’s it! We have our platform merge. Build (<a href="https://github.com/skeeto/w64devkit">w64devkit</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c
</code></pre></div></div>

<p>On Debian use <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-gcc</code> for <code class="language-plaintext highlighter-rouge">cc</code>. The <code class="language-plaintext highlighter-rouge">-e</code> linker option
selects the new, higher level entry point. After installing <a href="https://packages.debian.org/trixie/wine-binfmt">Wine
binfmt</a>, here’s how it looks on Debian:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs zlib
-lz
</code></pre></div></div>

<p>That’s the correct output, but is it using the cross sysroot? Ask it to
include the <code class="language-plaintext highlighter-rouge">-I</code> argument despite it being in the cross sysroot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz
</code></pre></div></div>

<p>Looking good! It passes the <code class="language-plaintext highlighter-rouge">pc_path</code> test, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig
</code></pre></div></div>

<p>Running <em>this same binary</em> on Windows after installing zlib in w64devkit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz
</code></pre></div></div>

<p>Also:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig
</code></pre></div></div>

<p>My Frankenwine is a success!</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>WebAssembly as a Python extension platform</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/01/"/>
    <id>urn:uuid:91e7555d-950f-47c6-84b8-bee0070f61a9</id>
    <updated>2026-01-01T21:21:19Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>Software above some complexity level tends to sport an extension language,
becoming a kind of software platform itself. Lua fills this role well, and
of course there’s JavaScript for web technologies. <a href="/blog/2025/04/04/">WebAssembly</a>
generalizes this, and any Wasm-targeting programming language can extend a
Wasm-hosting application. It has more friction than supplying a script in
a text file, but extension authors can write in their language of choice,
and use more polished development tools — debugging, <a href="/blog/2025/02/05/">testing</a>, etc.
— than typically available for a typical extension language. Python is
traditionally extended through native code behind a C interface, but it’s
recently become practical to extend Python with Wasm. That is we can ship
an architecture-independent Wasm blob inside a Python library, and use it
without requiring a native toolchain on the host system. Let’s discuss two
different use cases and their pitfalls.</p>

<p>Normally we’d extend Python in order to access an external interface that
Python cannot access on its own. Wasm runs in a sandbox with no access to
the outside world whatsoever, so it obviously isn’t useful for that case.
Extensions may also grant Python more speed, which is one of Wasm’s main
selling points. We can also use Wasm to access <em>embeddable capabilities</em>
written in a different programming language which do not require external
access.</p>

<p>For preferred non-WASI Wasm runtime is Volodymyr Shymanskyy’s <a href="https://github.com/wasm3/wasm3">wasm3</a>.
It’s plain old C and very friendly to embedding in the same was as, say,
SQLite. Performance is middling, though a C program running on wasm3 is
still quite a bit faster than an equivalent Python program. It has Python
bindings, <a href="https://github.com/wasm3/pywasm3">pywasm3</a>, but it’s distributed only in source code form. That
is, the host machine must have a C toolchain in order to use pywasm3,
which defeats my purposes here. If there’s a C toolchain, I might as well
just use that instead of going through Wasm.</p>

<p>For the use cases in this article, the best option is <a href="https://github.com/bytecodealliance/wasmtime-py">wasmtime-py</a>. The
distribution includes binaries for Windows, macOS, and Linux on x86-64 and
ARM64, which covers nearly all Python installations. Hosts require nothing
more than a Python interpreter, no native toolchains. It’s almost as good
as having Wasm built into Python itself. In my tests it’s 3x–10x faster
than wasm3, so for my first use case the situation is even better. The
catch is that it currently weighs ~18MiB (installed), and in the future
will likely rival the Python interpreter itself. The API also breaks on a
monthly basis, so you’re signing up for the upgrade treadmill lest your
own program perishes to bitrot after a couple of years. This article is
about version 40.</p>

<h3 id="usage-examples-and-gotchas">Usage examples and gotchas</h3>

<p>The <a href="https://github.com/bytecodealliance/wasmtime-py/tree/main/examples">official examples</a> don’t do anything non-trivial or interesting,
and so to figure things out I had to study <a href="https://bytecodealliance.github.io/wasmtime-py/">the documentation</a>,
which does not offer many hints. Basic setup looks like this:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">functools</span>
<span class="kn">import</span> <span class="nn">wasmtime</span>

<span class="n">store</span>    <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Store</span><span class="p">()</span>
<span class="n">module</span>   <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Module</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">store</span><span class="p">.</span><span class="n">engine</span><span class="p">,</span> <span class="s">"example.wasm"</span><span class="p">)</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Instance</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">module</span><span class="p">,</span> <span class="p">())</span>
<span class="n">exports</span>  <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">exports</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>

<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">get_buffer_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>
<span class="n">func1</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func1"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
<span class="n">func2</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func2"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
<span class="n">func3</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func3"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
</code></pre></div></div>

<p>A store is an allocation region from which we allocate all Wasm objects.
It is not possible to free individual objects except to discard the whole
store. Quite sensible, honestly. What’s <em>not</em> sensible is how often I have
to repeat myself, passing the store back into every object in order to use
it. These objects are associated with exactly one store and cannot be used
with different stores. <a href="https://docs.wasmtime.dev/api/wasmtime/struct.Store.html#cross-store-usage-of-items">Use the wrong store and it panics</a>: It’s
already keeping track internally! I do not understand why the interface
works this way. So to make things simpler, I use <code class="language-plaintext highlighter-rouge">functools.partial</code> to
bind the <code class="language-plaintext highlighter-rouge">store</code> parameter and so get the interface I expect.</p>

<p>The <code class="language-plaintext highlighter-rouge">get_buffer_ptr</code> object is a buffer protocol object, and if you’re
moving anything other than bytes that’s probably what you want to use to
access memory. The usual caveats apply for this object: If you <a href="/blog/2025/04/19/">change the
memory size</a> you probably want to grab a fresh buffer object. For
bytes (e.g. buffers and strings) I prefer the <code class="language-plaintext highlighter-rouge">read</code> and <code class="language-plaintext highlighter-rouge">write</code> methods.</p>

<p>Because <a href="https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md">multi-value</a> is still in an experimental state in the Wasm
ecosystem, you will likely not pass structs with Wasm. Anything more
complicated than scalars will require pointers and copying data in and out
of Wasm linear memory. This involves the usual trap that catches nearly
everyone: Wasm interfaces make no distinction between pointers and
integers, and Wasm runtimes interpret generally interpret all integers as
signed. What that means is <strong>your pointers are signed unless you take
action</strong>. Addresses start at 0, so this is bad, bad news.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">malloc</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func1"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>

<span class="n">hello</span> <span class="o">=</span> <span class="sa">b</span><span class="s">"hello"</span>
<span class="n">pointer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">hello</span><span class="p">))</span>
<span class="k">assert</span> <span class="n">pointer</span>
<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">write</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">hello</span><span class="p">,</span> <span class="n">pointer</span><span class="p">)</span>  <span class="c1"># WRONG!
</span></code></pre></div></div>

<p>To make matters worse, wasmtime-py adds its own footgun: The <code class="language-plaintext highlighter-rouge">read</code> and
<code class="language-plaintext highlighter-rouge">write</code> methods adopt the questionable Python convention of negative
indices acting from the end. If <code class="language-plaintext highlighter-rouge">malloc</code> returns a pointer in the upper
half of memory, the negative pointer will pass the bounds check inside
<code class="language-plaintext highlighter-rouge">write</code> because negative is valid, then quietly store to the wrong
address! Doh!</p>

<p>I wondered how common this error, so I searched online. I could find only
one non-trivial wasmtime-py use in the wild, in a sandboxed PDF reader. It
falls into the negative pointer trap as I expected. Not only that, it’s <a href="https://github.com/paulocoutinhox/pdfium-lib/blob/139d5037/modules/wasm.py#L601-L606">a
buffer overflow into Python’s memory space</a>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            <span class="n">buf_ptr</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pdf_data</span><span class="p">))</span>
            <span class="n">mem_data</span> <span class="o">=</span> <span class="n">memory</span><span class="p">.</span><span class="n">data_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>

            <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">byte</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">pdf_data</span><span class="p">):</span>
                <span class="n">mem_data</span><span class="p">[</span><span class="n">buf_ptr</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">byte</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">data_ptr</code> method returns a non-bounds-checked raw <code class="language-plaintext highlighter-rouge">ctypes</code> pointer,
so this is actually a double mistake. First, it shouldn’t trust pointers
coming out of Wasm if it cares at all about sandboxing. The second is the
potential negative pointer, which in this case would write outside of the
Wasm memory and in Python’s memory, hopefully seg-faulting.</p>

<p>What’s one to do? <strong>Every pointer coming out of Wasm must be truncated</strong>
with a mask:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pointer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(...)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>   <span class="c1"># correct for wasm32!
</span></code></pre></div></div>

<p>This interprets the result as unsigned. 64-bit Wasm needs a 64-bit mask,
though in practice you will never get a valid negative pointer from 64-bit
Wasm. This rule applies to JavaScript as well, where the idiom is:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">pointer</span> <span class="o">=</span> <span class="nx">malloc</span><span class="p">(...)</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">0</span>
</code></pre></div></div>

<p>Wasm runtimes cannot help — they lack the necessary information — and this
is perhaps a fundamental flaw in Wasm’s design. Once you know about it you
see this mistake happening everywhere.</p>

<p>Now that you have a proper address, you can apply it to a buffer protocol
view of memory. If you’re using NumPy there are various ways to interact
with this memory by wrapping it in NumPy types, though only if you’re on a
little endian host. (If you’re on a big endian machine, just give up on
running Wasm anyway.) The first use case I have in mind typically involves
copying plain Python values in and out. The <a href="https://docs.python.org/3/library/struct.html"><code class="language-plaintext highlighter-rouge">struct</code> package</a> is
quite handy here:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vec2</span>   <span class="o">=</span> <span class="n">malloc</span><span class="p">(...)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>
<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">get_buffer_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>
<span class="n">struct</span><span class="p">.</span><span class="n">pack_into</span><span class="p">(</span><span class="s">"&lt;ii"</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">vec2</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>

<p>It fills a similar role to <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/DataView">JavaScript <code class="language-plaintext highlighter-rouge">DataView</code></a>. If you’re copying
lots of numbers, with CPython it’s faster to construct a custom format
string rather than use a loop:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nums</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">...</span>
<span class="n">struct</span><span class="p">.</span><span class="n">pack_into</span><span class="p">(</span><span class="sa">f</span><span class="s">"&lt;</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">nums</span><span class="p">)</span><span class="si">}</span><span class="s">i"</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="o">*</span><span class="n">nums</span><span class="p">)</span>
</code></pre></div></div>

<p>To copy structures back out, use <code class="language-plaintext highlighter-rouge">struct.unpack_from</code>. If you’re moving
strings, you’ll need to <code class="language-plaintext highlighter-rouge">.encode()</code> and <code class="language-plaintext highlighter-rouge">.decode()</code> to convert to and from
<code class="language-plaintext highlighter-rouge">bytes</code>, which are well-suited to <code class="language-plaintext highlighter-rouge">read</code> and <code class="language-plaintext highlighter-rouge">write</code>.</p>

<p>In practice with real Wasm programs you’re going to be interacting with
the “guest” allocator from the outside, to request memory into which you
copy inputs for a function. In my examples I’ve used <code class="language-plaintext highlighter-rouge">malloc</code> because it
requires no elaboration, but as usual <a href="/blog/2023/09/27/">a bump allocator</a> solves
this so much better, especially because it doesn’t require stuffing a
whole general purpose allocator inside the Wasm program. Have one global
arena — no other threads will sharing that Wasm instance — rapid fire a
bunch of allocations as needed without any concern for memory management
in the “host”, call the function, which might allocate a result from that
arena, then reset the arena to clean up. In essence a stack for passing
values in and out.</p>

<h3 id="webassembly-as-faster-python">WebAssembly as faster Python</h3>

<p>Suppose we noticed a computational hot spot in our Python program in a
pure Python function (e.g. not calling out to an extension). Optimizing
this function would be wise. Based on my experiments if I re-implement
that function in C, compile it to Wasm, then run that bit of Wasm in place
of the original function, I can expect around a 10x speed-up. In general C
is more like 100x faster than Python, and the overhead of interfacing with
Wasm — copying stuff in and out, etc. — can be high, but not so high as to
not be profitable. This improves further if I can change the interface,
e.g. require callers to use the buffer protocol.</p>

<p>Thanks to wasmtime-py, I could introduce this change without fussing with
cross-compilers to build distribution binaries, nor require a toolchain on
the target, just a hefty Python package. Might be worth it.</p>

<p>My <a href="https://github.com/skeeto/scratch/tree/master/wasm-bench">main experimental benchmark</a> is a variation on <a href="/blog/2023/06/26/">my solution to
the “Two Sum” problem</a>, which I originally wrote for JavaScript, then
extended to pywasm3 and later wasmtime-py. It’s simple, just interesting
enough, and representative of the sort of Wasm drop-in I have in mind. It
has the same interface, but implements it with Wasm.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Original Pythonic interface
</span><span class="k">def</span> <span class="nf">twosum</span><span class="p">(</span><span class="n">nums</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">target</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
    <span class="p">...</span>

<span class="c1"># Stateful Wasm interface
</span><span class="k">class</span> <span class="nc">TwoSumWasm</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">store</span>    <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Store</span><span class="p">()</span>
        <span class="n">module</span>   <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Module</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">store</span><span class="p">.</span><span class="n">engine</span><span class="p">,</span> <span class="p">...)</span>
        <span class="n">instance</span> <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Instance</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">module</span><span class="p">,</span> <span class="p">())</span>
        <span class="p">...</span>

    <span class="k">def</span> <span class="nf">twosum</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">nums</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
        <span class="c1"># ... use wasm instance ...
</span></code></pre></div></div>

<p>There’s some state to it with the Wasm instance in tow. If you hide that
by making it global you’ll need to synchronize your threads around it. In
a multi-threaded program perhaps these would be lazily-constructed thread
locals. I haven’t had to solve this yet.</p>

<p>However, the weakness of the wasmtime “store” really shows: Notice how
compilation and instantiation are bound together in one store? <del>I cannot
compile once and then create disposable instances on the fly</del>, e.g. as
required for each run of a WASI program. Every instance permanently
extends the compilation store. In practice we must wastefully re-compile
the Wasm program for each disposable instance. Despite appearances,
compilation and instantiation are not actually distinct steps, as they are
in JavaScript’s Wasm API. <code class="language-plaintext highlighter-rouge">wasmtime.Instance</code> accepts a store as its first
argument, <em>suggesting</em> use of a different store for instantiation. That
would solve this problem, but as of this writing it <em>must</em> be the same
store used to compile the module. <del>This is a fatal flaw for certain real
use cases, particularly WASI.</del></p>

<p><strong>Update</strong>: Wolfgang Meier points out the <code class="language-plaintext highlighter-rouge">serialize</code> and <code class="language-plaintext highlighter-rouge">deserialize</code>
methods, which detaches a compiled module from its store, allowing for
independent instantations. I tried it, and it’s a practical workaround.
Overhead is low; no validation when deserializing. My benchmark now does
it for future reference, as I expect it to be my typical use case.</p>

<h3 id="webassembly-as-embedded-capabilities">WebAssembly as embedded capabilities</h3>

<p>Loup Vaillant’s <a href="https://monocypher.org/">Monocypher</a> is a wonderful cryptography library.
Lean, efficient, and embedding-friendly, so much so it’s distributed in
amalgamated form. It requires no libc or runtime, so we can compile it
straight to Wasm with almost any Clang toolchain:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -nostdlib -O2 -Wl,--no-entry -Wl,--export-all
        -o monocypher.wasm monocypher.c
</code></pre></div></div>

<p>It’s not “Wasm-aware” so I need <code class="language-plaintext highlighter-rouge">--export-all</code> to expose the interface.
This is swell because, as single translation unit, anything with external
linkage is the interface. Though remember what I said about interacting
with the guest allocator? This has no allocator, nor should it. It’s not
so usable in this form because we’d need to manage memory from the
outside. Do-able, but it’s easy to improve by adding a couple more
functions, sticking to a single translation unit:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"monocypher.c"</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">char</span>  <span class="n">__heap_base</span><span class="p">[];</span>
<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">heap_used</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">heap_high</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">bump_alloc</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">bump_reset</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">heap_used</span> <span class="o">-</span> <span class="n">__heap_base</span><span class="p">;</span>
    <span class="n">__builtin_memset</span><span class="p">(</span><span class="n">__heap_base</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>  <span class="c1">// wipe keys, etc.</span>
    <span class="n">heap_used</span> <span class="o">=</span> <span class="n">__heap_base</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve <a href="/blog/2025/04/19/">discussed <code class="language-plaintext highlighter-rouge">__heap_base</code> before</a>, which is part of the ABI.
We’ll push keys, inputs, etc. onto this “stack”, run our cryptography
routine, copy out the result, then reset the bump allocator, which wipes
out all sensitive data. Often <code class="language-plaintext highlighter-rouge">memset</code> is insufficient — typically it’s
zero-then-free, and compilers see the <a href="/blog/2025/09/30/">lifetime</a> about to end — but no
lifetime ends here, and stores to this “heap” memory externally observable
as far as the abstract machine can tell. (Otherwise we couldn’t reliably
copy out our results!)</p>

<p>There’s a lot to this API, but I’m only going to look at <a href="https://monocypher.org/manual/aead">the AEAD
interface</a>. We “lock” up some data in an encrypted box, write any
unencrypted label we’d like on the outside. Then later we can unlock the
box, which will only open for us if neither the contents of the box nor
the label were tampered with. That’s some solid API design:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">crypto_aead_lock</span><span class="p">(</span><span class="kt">uint8_t</span>       <span class="o">*</span><span class="n">cipher_text</span><span class="p">,</span>
                      <span class="kt">uint8_t</span>        <span class="n">mac</span>  <span class="p">[</span><span class="mi">16</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">key</span>  <span class="p">[</span><span class="mi">32</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">nonce</span><span class="p">[</span><span class="mi">24</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">ad</span><span class="p">,</span>         <span class="kt">size_t</span> <span class="n">ad_size</span><span class="p">,</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">plain_text</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">text_size</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">crypto_aead_unlock</span><span class="p">(</span><span class="kt">uint8_t</span>       <span class="o">*</span><span class="n">plain_text</span><span class="p">,</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">mac</span>  <span class="p">[</span><span class="mi">16</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">key</span>  <span class="p">[</span><span class="mi">32</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">nonce</span><span class="p">[</span><span class="mi">24</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">ad</span><span class="p">,</span>          <span class="kt">size_t</span> <span class="n">ad_size</span><span class="p">,</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">cipher_text</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">text_size</span><span class="p">);</span>
</code></pre></div></div>

<p>By compiling to Wasm we can access this functionality from Python almost
like it was pure Python, and interact with other systems using Monocypher.</p>

<p>Since Monocypher does not interact with the outside world on its own, it
relies on callers to use their system’s CSPRNG to create those nonces and
keys, which we’ll do using <a href="https://docs.python.org/3/library/secrets.html">the <code class="language-plaintext highlighter-rouge">secrets</code> built-in package</a>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Monocypher</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="p">...</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_read</span>   <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">memory</span><span class="p">.</span><span class="n">read</span><span class="p">,</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_write</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">memory</span><span class="p">.</span><span class="n">write</span><span class="p">,</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">__alloc</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"bump_alloc"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"bump_reset"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span>   <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"crypto_aead_lock"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_unlock</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"crypto_aead_unlock"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span> <span class="o">=</span> <span class="n">secrets</span><span class="p">.</span><span class="n">SystemRandom</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_alloc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__alloc</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>

    <span class="k">def</span> <span class="nf">generate_key</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span><span class="p">.</span><span class="n">randbytes</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">generate_nonce</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span><span class="p">.</span><span class="n">randbytes</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>

    <span class="p">...</span>
</code></pre></div></div>

<p>With a solid foundation, all that follows comes easily. A <code class="language-plaintext highlighter-rouge">finally</code>
guarantees secrets are always removed from Wasm memory, and the rest is
just about copying bytes around:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">aead_lock</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">ad</span> <span class="o">=</span> <span class="sa">b</span><span class="s">""</span><span class="p">):</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">==</span> <span class="mi">32</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">macptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
            <span class="n">keyptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">nonceptr</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>
            <span class="n">adptr</span>    <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">))</span>
            <span class="n">textptr</span>  <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">keyptr</span><span class="p">)</span>
            <span class="n">nonce</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate_nonce</span><span class="p">()</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">nonce</span><span class="p">,</span> <span class="n">nonceptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">ad</span><span class="p">,</span>    <span class="n">adptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">text</span><span class="p">,</span>  <span class="n">textptr</span><span class="p">)</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span><span class="p">(</span>
                <span class="n">textptr</span><span class="p">,</span>
                <span class="n">macptr</span><span class="p">,</span>
                <span class="n">keyptr</span><span class="p">,</span>
                <span class="n">nonceptr</span><span class="p">,</span>
                <span class="n">adptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">),</span>
                <span class="n">textptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span>
            <span class="p">)</span>
            <span class="k">return</span> <span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">macptr</span><span class="p">,</span> <span class="n">macptr</span><span class="o">+</span><span class="mi">16</span><span class="p">),</span>
                <span class="n">nonce</span><span class="p">,</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">textptr</span><span class="p">,</span> <span class="n">textptr</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)),</span>
            <span class="p">)</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span><span class="p">()</span>
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">aead_unlock</code> is basically the same in reverse, but throws if the box
fails to unlock, perhaps due to tampering:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">aead_unlock</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">mac</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">nonce</span><span class="p">,</span> <span class="n">ad</span> <span class="o">=</span> <span class="sa">b</span><span class="s">""</span><span class="p">):</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">mac</span><span class="p">)</span> <span class="o">==</span> <span class="mi">16</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">==</span> <span class="mi">32</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">nonce</span><span class="p">)</span> <span class="o">==</span> <span class="mi">24</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">macptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
            <span class="n">keyptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">nonceptr</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>
            <span class="n">adptr</span>    <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">))</span>
            <span class="n">textptr</span>  <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">mac</span><span class="p">,</span> <span class="n">macptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">keyptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">nonce</span><span class="p">,</span> <span class="n">nonceptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">ad</span><span class="p">,</span> <span class="n">adptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">textptr</span><span class="p">)</span>

            <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_unlock</span><span class="p">(</span>
                <span class="n">textptr</span><span class="p">,</span>
                <span class="n">macptr</span><span class="p">,</span>
                <span class="n">keyptr</span><span class="p">,</span>
                <span class="n">nonceptr</span><span class="p">,</span>
                <span class="n">adptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">),</span>
                <span class="n">textptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span>
            <span class="p">):</span>
                <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"AEAD mismatch"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">textptr</span><span class="p">,</span> <span class="n">textptr</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span><span class="p">()</span>
</code></pre></div></div>

<p>Usage:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mc</span> <span class="o">=</span> <span class="n">Monocypher</span><span class="p">()</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">generate_key</span><span class="p">()</span>
<span class="n">message</span> <span class="o">=</span> <span class="s">"Hello, world!"</span>
<span class="n">mac</span><span class="p">,</span> <span class="n">nonce</span><span class="p">,</span> <span class="n">encrypted</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">aead_lock</span><span class="p">(</span><span class="n">message</span><span class="p">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">key</span><span class="p">)</span>
</code></pre></div></div>

<p>Transmit <code class="language-plaintext highlighter-rouge">mac</code>, <code class="language-plaintext highlighter-rouge">nonce</code>, and <code class="language-plaintext highlighter-rouge">encrypted</code> to the other party (or your
future self), who already has the <code class="language-plaintext highlighter-rouge">key</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">decrypted</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">aead_unlock</span><span class="p">(</span><span class="n">encrypted</span><span class="p">,</span> <span class="n">mac</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">nonce</span><span class="p">)</span>
</code></pre></div></div>

<p>Find the <strong>complete source <a href="https://github.com/skeeto/scratch/tree/master/wasm-monocypher">in my scratch repository</a></strong>.</p>

<p>While I have a few reservations about wasmtime-py, it fascinates me how
well this all works. It’s been my hammer in search of a nail for some time
now.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Freestyle linked lists tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/31/"/>
    <id>urn:uuid:355dfc03-0e7c-4bae-92fe-5b52174de325</id>
    <updated>2025-12-31T11:59:59Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>Linked lists are a data structure basic building block, with especially
flexible allocation behavior. They’re not just a useful starting point,
but sometimes a sound foundation for future growth. I’m going to start
with the beginner stuff, then <em>without disrupting the original linked
list</em>, enhance it with new capabilities.</p>

<h3 id="linked-list-basics">Linked list basics</h3>

<p>For the sake of an interesting example, I’m will demonstrate with the same
concept as <a href="/blog/2025/01/19/">last time I talked about data structures</a>: a collection
of key/value strings, in the form of an environment variables. This time
in linked list form:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="kt">uint64_t</span> <span class="nf">hash64</span><span class="p">(</span><span class="n">Str</span><span class="p">);</span>
<span class="n">bool</span>     <span class="nf">equals</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="n">Str</span><span class="p">);</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">Env</span> <span class="n">Env</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Env</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>It will be sourced from some string, formatted like the <code class="language-plaintext highlighter-rouge">env</code> program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Str</span> <span class="n">input</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"EDITOR=vim</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"HOME=/home/user</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"PATH=/bin:/usr/bin</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"SHELL=/bin/bash</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"TERM=xterm-256color</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"USER=user</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"SHELL=/bin/sh</span><span class="se">\n</span><span class="s">"</span>   <span class="c1">// &lt;- repeated entry</span>
    <span class="p">);</span>
</code></pre></div></div>

<p>And all the parser heavy lifting will be done by <a href="/blog/2025/03/02/">our ever-handy <code class="language-plaintext highlighter-rouge">cut</code>
function</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">tail</span><span class="p">;</span>
    <span class="n">Str</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Cut</span><span class="p">;</span>

<span class="n">Cut</span> <span class="nf">cut</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
</code></pre></div></div>

<p>The simplest way to build up a linked list is like a stack, pushing
objects into the front. Zero-initialized <code class="language-plaintext highlighter-rouge">head</code> pointer, point the new
node at it, then make that node the new <code class="language-plaintext highlighter-rouge">head</code> element:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_reversed</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// 1</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="n">line</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="n">Cut</span>  <span class="n">pair</span>  <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="sc">'='</span><span class="p">);</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>   <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Env</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span>   <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">;</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span>  <span class="o">=</span> <span class="n">head</span><span class="p">;</span>  <span class="c1">// 2</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>  <span class="c1">// 3</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s it, a complete linked list implementation in three lines of code.
No big deal. Because of the bump allocator, nodes are packed in order in
memory, so the usual cache objections for linked lists do not apply. LIFO
semantics mean the linked list is in reverse order from the source order.
If we’re doing a linear scan through the linked list, the last entry in
the source wins, which may be what you wanted:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_linear</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>

    <span class="c1">// ...</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>  <span class="o">=</span> <span class="n">parse_reversed</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">lookup_linear</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));</span>  <span class="c1">// &lt;- "/bin/sh"</span>
</code></pre></div></div>

<p>It’s just one more line of code to maintain the original order, using a
very simple double-pointer technique:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_ordered</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span>  <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// 1</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>  <span class="c1">// 2</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>  <span class="c1">// 3</span>
        <span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>  <span class="c1">// 4</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>No branches necessary, nor dummy nodes. A pointer to the last pointer in
the list works even for empty lists. The <code class="language-plaintext highlighter-rouge">tail</code> pointer is unneeded once
the list is complete. This form has queue behavior.</p>

<h3 id="faster-look-up-with-a-tree">Faster look-up with a tree</h3>

<p>If you’re doing many look-ups, or if the list is long, those linear scans
to find items in the list are not ideal. We can introduce an intrusive
hash map, in the form of <a href="/blog/2023/09/30/">a hash trie</a>, by adding two more pointers
to the linked list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Env</span> <span class="n">Env</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Env</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>  <span class="c1">// &lt;- hash map linkage</span>
    <span class="n">Str</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I’ve found it’s simplest to construct a node into the hash map, then link
it onto the list tail. That constructor looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">new_env</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Env</span> <span class="o">**</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">,</span> <span class="n">Str</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">env</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Env</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then we swap that into the <code class="language-plaintext highlighter-rouge">head</code>/<code class="language-plaintext highlighter-rouge">tail</code> version in place of the original
<code class="language-plaintext highlighter-rouge">new</code> macro call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_mapped</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span>  <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new_env</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">,</span> <span class="n">pair</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">);</span>
        <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>
        <span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is now a linked list and a hash map at the same time, built-up piece
by piece without any resizing. We still have the original linked list, but
we can now search it in log time. The look-up function resembles the
constructor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_logn</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">env</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because of the FIFO semantics, it finds the first match in the source:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>   <span class="o">=</span> <span class="n">parse_mapped</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">Str</span>  <span class="n">value</span> <span class="o">=</span> <span class="n">lookup_logn</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));</span>  <span class="c1">// &lt;- /bin/bash</span>
</code></pre></div></div>

<p>The other matches are also in the tree, and we can find those as well by
continuing traversal. That is, it’s already a multi-map. This particular
interface can’t pick up where it left off, but we can build one that does
using an iterator/cursor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span><span class="p">;</span>
    <span class="n">Str</span>      <span class="n">key</span><span class="p">;</span>
    <span class="n">Env</span>     <span class="o">*</span><span class="n">env</span><span class="p">;</span>
<span class="p">}</span> <span class="n">EnvIter</span><span class="p">;</span>

<span class="n">EnvIter</span> <span class="nf">new_enviter</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">EnvIter</span><span class="p">){</span><span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">),</span> <span class="n">key</span><span class="p">,</span> <span class="n">env</span><span class="p">};</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">enviter_next</span><span class="p">(</span><span class="n">EnvIter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">cur</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">hash</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">cur</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">cur</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Update</strong>: Thanks to <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CSJ2PR12MB79208563F4485DCAA27D5776A2BAA@SJ2PR12MB7920.namprd12.prod.outlook.com%3E?__goaway_challenge=meta-refresh&amp;__goaway_id=5902363e020028d0488062799debf13b&amp;__goaway_referer=https%3A%2F%2Flists.sr.ht%2F~skeeto%2Fpublic-inbox">Daniel Kareh for a correction</a>.</p>

<p>Then we can use a loop to visit every match in source order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">parse_mapped</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">EnvIter</span> <span class="n">it</span> <span class="o">=</span> <span class="n">new_enviter</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));;)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">enviter_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">value</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<h3 id="faster-look-up-with-an-index-table">Faster look-up with an index table</h3>

<p>If the list is static once constructed, or if look-ups happen much more
frequently than the list grows, we can find list items even faster by
constructing an index table over the list: <a href="/blog/2022/08/08/">an MSI hash table</a>. This
table avoids redundancy by <em>sharing structure with the list</em>. Because it’s
a flat table, if we keep adding to the list then eventually we’ll need to
reconstruct a larger table when it becomes overloaded.</p>

<p>The table itself has a very simple structure, just an array and its size,
expressed as a power-of-two exponent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">slots</span><span class="p">;</span>
    <span class="kt">int</span>   <span class="n">exp</span><span class="p">;</span>
<span class="p">}</span> <span class="n">EnvTable</span><span class="p">;</span>
</code></pre></div></div>

<p>We do not need the <code class="language-plaintext highlighter-rouge">child</code> nodes, and so linked list nodes are untouched.
That is, it’s not intrusive. In fact, we can build any arbitrary number of
tables over a list, perhaps indexing different properties for different
sorts of queries. The idea is that we build the list first, then create
the table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EnvTable</span> <span class="nf">new_table</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Compute list length</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">len</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Then compute an appropriate table size</span>
    <span class="n">EnvTable</span> <span class="n">table</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">table</span><span class="p">.</span><span class="n">exp</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">one</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="p">(</span><span class="n">one</span><span class="o">&lt;&lt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="n">one</span><span class="o">&lt;&lt;</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="o">-</span><span class="mi">3</span><span class="p">))</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="o">++</span><span class="p">)</span> <span class="p">{}</span>
    <span class="n">table</span><span class="p">.</span><span class="n">slots</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">one</span><span class="o">&lt;&lt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">Env</span> <span class="o">*</span><span class="p">);</span>

    <span class="c1">// Then insert linked list items into the table</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">var</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">);</span>
        <span class="kt">size_t</span>   <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">var</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">table</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how only searches for an empty slot, not for a matching entry. That’s
because this too is a multi-map, also with elements in insertion order.
Look-ups are constant time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_constant</span><span class="p">(</span><span class="n">EnvTable</span> <span class="n">table</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
    <span class="kt">size_t</span>   <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It finds the earliest match in the list, meaning an index over the
“reverse” list will find the last entry in the source. The indexed-over
property is the input to <code class="language-plaintext highlighter-rouge">hash64</code> and <code class="language-plaintext highlighter-rouge">equals</code>. By using a different input
to these functions we could build another table on, say, value length if
that’s a property on which we needed to find elements efficiently. Again,
for multi-map iteration we need some kind of iterator or cursor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">EnvTable</span> <span class="n">table</span><span class="p">;</span>
    <span class="n">Str</span>      <span class="n">key</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">step</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">i</span><span class="p">;</span>
<span class="p">}</span> <span class="n">TableIter</span><span class="p">;</span>

<span class="n">TableIter</span> <span class="nf">new_tableiter</span><span class="p">(</span><span class="n">EnvTable</span> <span class="n">table</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
    <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">idx</span>  <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">TableIter</span><span class="p">){</span><span class="n">table</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">step</span><span class="p">,</span> <span class="n">idx</span><span class="p">};</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">table_next</span><span class="p">(</span><span class="n">TableIter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">mask</span>  <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">Env</span>  <span class="o">**</span><span class="n">slots</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">+</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Its usage looks just like the other multi-map:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">parse_ordered</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">EnvTable</span> <span class="n">table</span> <span class="o">=</span> <span class="n">new_table</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">TableIter</span> <span class="n">it</span> <span class="o">=</span> <span class="n">new_tableiter</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));;)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">table_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">value</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>With these techniques at hand, I can start with linked lists when they are
convenient, and later add needed features without fundamentally changing
the underlying data structure. None of this requires runtime support, and
so it fits comfortably on embedded systems, tiny WebAssembly programs,
etc.  All the above code is available ready to run: <a href="https://gist.github.com/skeeto/493823d5956dfdc1d95d8c390c2b0e1d"><code class="language-plaintext highlighter-rouge">list.c</code></a>.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Unix "find" expressions compiled to bytecode</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/23/"/>
    <id>urn:uuid:bbe2671b-378d-40b1-9564-c3a3b798dfb4</id>
    <updated>2025-12-23T04:20:22Z</updated>
    <category term="c"/><category term="compsci"/>
    <content type="html">
      <![CDATA[<p>In preparation for a future project, I was thinking about at the <a href="https://pubs.opengroup.org/onlinepubs/9799919799/utilities/find.html">unix
<code class="language-plaintext highlighter-rouge">find</code> utility</a>. It operates a file system hierarchies, with basic
operations selected and filtered using a specialized expression language.
Users compose operations using unary and binary operators, grouping with
parentheses for precedence. <code class="language-plaintext highlighter-rouge">find</code> may apply the expression to a great
many files, so compiling it into a bytecode, resolving as much as possible
ahead of time, and minimizing the per-element work, seems like a prudent
implementation strategy. With some thought, I worked out a technique to do
so, which was simpler than I expected, and I’m pleased with the results. I
was later surprised all the real world <code class="language-plaintext highlighter-rouge">find</code> implementations I examined
use <a href="https://craftinginterpreters.com/a-tree-walk-interpreter.html">tree-walk interpreters</a> instead. This article describes how my
compiler works, with a runnable example, and lists ideas for improvements.</p>

<p>For a quick overview, the syntax looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find [-H|-L] path... [expression...]
</code></pre></div></div>

<p>Technically at least one path is required, but most implementations imply
<code class="language-plaintext highlighter-rouge">.</code> when none are provided. If no expression is supplied, the default is
<code class="language-plaintext highlighter-rouge">-print</code>, e.g. print everything under each listed path. This prints the
whole tree, including directories, under the current directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find .
</code></pre></div></div>

<p>To only print files, we could use <code class="language-plaintext highlighter-rouge">-type f</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -a -print
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">-a</code> is the logical AND binary operator. <code class="language-plaintext highlighter-rouge">-print</code> always evaluates
to true. It’s never necessary to write <code class="language-plaintext highlighter-rouge">-a</code>, and adjacent operations are
implicitly joined with <code class="language-plaintext highlighter-rouge">-a</code>. We can keep chaining them, such as finding
all executable files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -executable -print
</code></pre></div></div>

<p>If no <code class="language-plaintext highlighter-rouge">-exec</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, or <code class="language-plaintext highlighter-rouge">-print</code> (or similar side-effect extensions like
<code class="language-plaintext highlighter-rouge">-print0</code> or <code class="language-plaintext highlighter-rouge">-delete</code>) are present, the whole expression is wrapped in an
implicit <code class="language-plaintext highlighter-rouge">( expr ) -print</code>. So we could also write this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -executable
</code></pre></div></div>

<p>Use <code class="language-plaintext highlighter-rouge">-o</code> for logical OR. To print all files with the executable bit <em>or</em>
with a <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f \( -executable -o -name '*.exe' \)
</code></pre></div></div>

<p>I needed parentheses because <code class="language-plaintext highlighter-rouge">-o</code> has lower precedence than <code class="language-plaintext highlighter-rouge">-a</code>, and
because parentheses are shell metacharacters I also needed to escape them
for the shell. It’s a shame <code class="language-plaintext highlighter-rouge">find</code> didn’t use <code class="language-plaintext highlighter-rouge">[</code> and <code class="language-plaintext highlighter-rouge">]</code> instead! There’s
also a unary logical NOT operator, <code class="language-plaintext highlighter-rouge">!</code>. To print all non-executable files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f ! -executable
</code></pre></div></div>

<p>Binary operators are short-circuiting, so this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find -type d -a -exec du -sh {} +
</code></pre></div></div>

<p>Only lists the sizes of directories, as the <code class="language-plaintext highlighter-rouge">-type d</code> fails causing the
whole expression to evaluate to false without evaluating <code class="language-plaintext highlighter-rouge">-exec</code>. Or
equivalently with <code class="language-plaintext highlighter-rouge">-o</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find ! -type d -o -exec du -sh {} +
</code></pre></div></div>

<p>If it’s not a directory then the left-hand side evaluates to true, and the
right-hand side is not evaluated. All three implementations I examined
(GNU, BSD, BusyBox) have a <code class="language-plaintext highlighter-rouge">-regex</code> extension, and eagerly compile the
regular expression even if the operation is never evaluated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -print -o -regex [
find: bad regex '[': Invalid regular expression
</code></pre></div></div>

<p>I was surprised by this because it doesn’t seem to be in the spirit of the
original utility (“The second expression shall not be evaluated if the
first expression is true.”), and I’m used to the idea of short-circuit
validation for the right-hand side of a logical expression. Recompiling
for each evaluation would be unwise, but it could happen lazily such that
an invalid regular expression only causes an error if it’s actually used.
No big deal, just a curiosity.</p>

<h3 id="bytecode-design">Bytecode design</h3>

<p>A bytecode interpreter needs to track just one result at a time, making it
a single register machine, with a 1-bit register at that. I came up with
these five opcodes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>halt
not
braf   LABEL
brat   LABEL
action NAME [ARGS...]
</code></pre></div></div>

<p>Obviously <code class="language-plaintext highlighter-rouge">halt</code> stops the program. While I could just let it “run off the
end” it’s useful to have an actual instruction so that I can attach a
label and jump to it. The <code class="language-plaintext highlighter-rouge">not</code> opcode negates the register. <code class="language-plaintext highlighter-rouge">braf</code> is
“branch if false”, jumping (via relative immediate) to the labeled (in
printed form) instruction if the register is false. <code class="language-plaintext highlighter-rouge">brat</code> is “branch if
true”. Together they implement the <code class="language-plaintext highlighter-rouge">-a</code> and <code class="language-plaintext highlighter-rouge">-o</code> operators. In practice
there are no loops and jumps are always forward: <code class="language-plaintext highlighter-rouge">find</code> is <a href="/blog/2016/04/30/">not Turing
complete</a>.</p>

<p>In a real implementation each possible action (<code class="language-plaintext highlighter-rouge">-name</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, <code class="language-plaintext highlighter-rouge">-print</code>,
<code class="language-plaintext highlighter-rouge">-type</code>, etc.) would get a dedicated opcode. This requires implementing
each operator, at least in part, in order to correctly parse the whole
<code class="language-plaintext highlighter-rouge">find</code> expression. For now I’m just focused on the bytecode compiler, so
this opcode is a stand-in, and it kind of pretends based on looks. Each
action sets the register, and actions like <code class="language-plaintext highlighter-rouge">-print</code> always set it to true.
My compiler is <a href="https://github.com/skeeto/scratch/blob/c142e729/parsers/findc.c">called <strong><code class="language-plaintext highlighter-rouge">findc</code> (“find compiler”)</strong></a>.</p>

<p><strong>Update</strong>: Or try <a href="https://nullprogram.com/scratch/findc/">the <strong>online demo</strong></a> via Wasm! This version
includes a <a href="https://github.com/skeeto/scratch/commit/2c0a4b8f">peephole optimizer</a> I wrote after publishing this
article.</p>

<p>I assume readers of this program are familiar with <a href="/blog/2025/01/19/"><code class="language-plaintext highlighter-rouge">push</code> macro</a>
and <a href="/blog/2025/06/26/"><code class="language-plaintext highlighter-rouge">Slice</code> macro</a>. Because of the latter it requires a very
recent C compiler, like GCC 15 (e.g. via <a href="https://github.com/skeeto/w64devkit">w64devkit</a>) or Clang 22. Try
out some <code class="language-plaintext highlighter-rouge">find</code> commands and see how they appear as bytecode. The simplest
case is also optimal:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc
// path: .
        action  -print
        halt
</code></pre></div></div>

<p>Print the path then halt. Simple. Stepping it up:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f -executable
// path: .
        action  -type f
        braf    L1
        action  -executable
L1:     braf    L2
        action  -print
L2:     halt
</code></pre></div></div>

<p>If the path is not a file, it skips over the rest of the program by way of
the second branch instruction. It’s correct, but already we can see room
for improvement. This would be better:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        action  -type f
        braf    L1
        action  -executable
        braf    L1
        action  -print
L1:     halt
</code></pre></div></div>

<p>More complex still:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L1
        action  -name *.exe
L1:     braf    L2
        action  -print
L2:     halt
</code></pre></div></div>

<p>Inside the parentheses, if <code class="language-plaintext highlighter-rouge">-executable</code> succeeds, the right-hand side is
skipped. Though the <code class="language-plaintext highlighter-rouge">brat</code> jumps straight to a <code class="language-plaintext highlighter-rouge">braf</code>. It would be better
to jump ahead one more instruction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        action  -type f
        braf    L2
        action  -executable
        brat    L1
        action  -name *.exe
        braf    L2
L1      action  -print
L2:     halt
</code></pre></div></div>

<p>Silly things aren’t optimized either:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc ! ! -executable
// path: .
        action  -executable
        not
        not
        braf    L1
        action  -print
L1:     halt
</code></pre></div></div>

<p>Two <code class="language-plaintext highlighter-rouge">not</code> in a row cancel out, and so these instructions could be
eliminated. Overall this compiler could benefit from a <a href="https://en.wikipedia.org/wiki/Peephole_optimization">peephole
optimizer</a>, scanning over the program repeatedly, making small
improvements until no more can be made:</p>

<ul>
  <li>Delete <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">not</code>.</li>
  <li>A <code class="language-plaintext highlighter-rouge">brat</code> to a <code class="language-plaintext highlighter-rouge">braf</code> re-targets ahead one instruction, and vice versa.</li>
  <li>Jumping onto an identical jump adopts its target for itself.</li>
  <li>A <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">braf</code> might convert to a <code class="language-plaintext highlighter-rouge">brat</code>, and vice versa.</li>
  <li>Delete side-effect-free instructions before <code class="language-plaintext highlighter-rouge">halt</code> (e.g. <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">halt</code>).</li>
  <li>Exploit always-true actions, e.g. <code class="language-plaintext highlighter-rouge">-print</code>-<code class="language-plaintext highlighter-rouge">braf</code> can drop the branch.</li>
</ul>

<p>Writing a bunch of peephole pattern matchers sounds kind of fun. Though my
compiler would first need a slightly richer representation in order to
detect and fix up changes to branches. One more for the road:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f ! \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L2
        action  -name *.exe
L2:     not
L1:     braf    L3
        action  -print
L3:     halt
</code></pre></div></div>

<p>The unoptimal jumps hint at my compiler’s structure. If you’re feeling up
for a challenge, pause here to consider how you’d build this compiler, and
how it might produce these particular artifacts.</p>

<h3 id="parsing-and-compiling">Parsing and compiling</h3>

<p>Before I even considered the shape of the bytecode I knew I needed to
convert <code class="language-plaintext highlighter-rouge">find</code> infix into a compiler-friendly postfix. That is, this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-type f -a ! ( -executable -o -name *.exe )
</code></pre></div></div>

<p>Becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-type f -executable -name *.exe -o ! -a
</code></pre></div></div>

<p>Which, importantly, erases the parentheses. This comes in as an <code class="language-plaintext highlighter-rouge">argv</code>
array, so it’s already tokenized for us by the shell <a href="/blog/2022/02/18/">or runtime</a>. The
classic <a href="https://en.wikipedia.org/wiki/Shunting_yard_algorithm">shunting-yard algorithm</a> solves this problem easily enough.
We have an output queue that goes into the compiler, and a token stack for
tracking <code class="language-plaintext highlighter-rouge">-a</code>, <code class="language-plaintext highlighter-rouge">-o</code>, <code class="language-plaintext highlighter-rouge">!</code>, and <code class="language-plaintext highlighter-rouge">(</code>. Then we walk <code class="language-plaintext highlighter-rouge">argv</code> in order:</p>

<ul>
  <li>
    <p>Actions go straight into the output queue.</p>
  </li>
  <li>
    <p>If we see one of the special stack tokens we push it onto the stack,
first popping operators with greater precedence into the queue, stopping
at <code class="language-plaintext highlighter-rouge">(</code>.</p>
  </li>
  <li>
    <p>If we see <code class="language-plaintext highlighter-rouge">)</code> we pop the stack into the output queue until we see <code class="language-plaintext highlighter-rouge">(</code>.</p>
  </li>
</ul>

<p>When we’re out of tokens, pop the remaining stack into the queue. My
parser synthesizes <code class="language-plaintext highlighter-rouge">-a</code> where it’s implied, so the compiler always sees
logical AND. If the expression contains no <code class="language-plaintext highlighter-rouge">-exec</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, or <code class="language-plaintext highlighter-rouge">-print</code>,
after processing is complete the parser puts <code class="language-plaintext highlighter-rouge">-print</code> then <code class="language-plaintext highlighter-rouge">-a</code> into the
queue, which effectively wraps the whole expression in <code class="language-plaintext highlighter-rouge">( expr ) -print</code>.
By clearing the stack first, the real expression is effectively wrapped in
parentheses, so no parenthesis tokens need to be synthesized.</p>

<p>I’ve used the shunting-yard algorithm many times before, so this part was
easy. The new part was coming up with an algorithm to convert a series of
postfix tokens into bytecode. My solution is the compiler <strong>maintains a
stack of bytecode fragments</strong>. That is, each stack element is a sequence
of one or more bytecode instructions. Branches use relative addresses, so
they’re position-independent, and I can concatenate code fragments without
any branch fix-ups. It takes the following actions from queue tokens:</p>

<ul>
  <li>
    <p>For an action token, create an <code class="language-plaintext highlighter-rouge">action</code> instruction, and push it onto
the fragment stack as a new fragment.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">!</code> token, pop the top fragment, append a <code class="language-plaintext highlighter-rouge">not</code> instruction, and
push it back onto the stack.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">-a</code> token, pop the top two fragments, join then with a <code class="language-plaintext highlighter-rouge">braf</code> in
the middle which jumps just beyond the second fragment. That is, if the
first fragment evaluates to false, skip over the second fragment into
whatever follows.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">-o</code> token, just like <code class="language-plaintext highlighter-rouge">-a</code> but use <code class="language-plaintext highlighter-rouge">brat</code>. If the first fragment
is true, we skip over the second fragment.</p>
  </li>
</ul>

<p>If the expression is valid, at the end of this process the stack contains
exactly one fragment. Append a <code class="language-plaintext highlighter-rouge">halt</code> instruction to this fragment, and
that’s our program! If the final fragment contained a branch just beyond
its end, this <code class="language-plaintext highlighter-rouge">halt</code> is that branch target. A few peephole optimizations
and could probably be an optimal program for this instruction set.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Closures as Win32 window procedures</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/12/"/>
    <id>urn:uuid:7bf46ec6-a8b2-4ffa-857a-86c040357702</id>
    <updated>2025-12-12T19:52:10Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Back in 2017 I wrote <a href="/blog/2017/01/08/">about a technique for creating closures in C</a>
using <a href="/blog/2015/03/19/">JIT-compiled</a> wrapper. It’s neat, though rarely necessary in
real programs, so I don’t think about it often. I applied it to <code class="language-plaintext highlighter-rouge">qsort</code>,
which <a href="/blog/2023/02/11/">sadly</a> accepts no context pointer. More practical would be
working around <a href="/blog/2023/12/17/">insufficient custom allocator interfaces</a>, to
create allocation functions at run-time bound to a particular allocation
region. I’ve learned a lot since I last wrote about this subject, and <a href="https://lowkpro.com/blog/creating-c-closures-from-lua-closures.html">a
recent article</a> had me thinking about it again, and how I could do
better than before. In this article I will enhance Win32 window procedure
callbacks with a fifth argument, allowing us to more directly pass extra
context. I’m using <a href="https://github.com/skeeto/w64devkit">w64devkit</a> on x64, but the everything here should
work out-of-the-box with any x64 toolchain that speaks GNU assembly.</p>

<p>A <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nc-winuser-wndproc">window procedure</a> has this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">Wndproc</span><span class="p">(</span>
  <span class="n">HWND</span> <span class="n">hWnd</span><span class="p">,</span>
  <span class="n">UINT</span> <span class="n">Msg</span><span class="p">,</span>
  <span class="n">WPARAM</span> <span class="n">wParam</span><span class="p">,</span>
  <span class="n">LPARAM</span> <span class="n">lParam</span><span class="p">,</span>
<span class="p">);</span>
</code></pre></div></div>

<p>To create a window we must first register a class with <code class="language-plaintext highlighter-rouge">RegisterClass</code>,
which accepts a set of properties describing a window class, including a
pointer to one of these functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="p">...;</span>

    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">my_wndproc</span><span class="p">,</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>

    <span class="n">HWND</span> <span class="n">hwnd</span> <span class="o">=</span> <span class="n">CreateWindowExA</span><span class="p">(</span><span class="s">"my_class"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">state</span><span class="p">);</span>
</code></pre></div></div>

<p>The thread drives a message pump with events from the operating system,
dispatching them to this procedure, which then manipulates the program
state in response:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="n">MSG</span> <span class="n">msg</span><span class="p">;</span> <span class="n">GetMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);)</span> <span class="p">{</span>
        <span class="n">TranslateMessage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>
        <span class="n">DispatchMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>  <span class="c1">// calls the window procedure</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>All four <code class="language-plaintext highlighter-rouge">WNDPROC</code> parameters are determined by Win32. There is no context
pointer argument. So how does this procedure access the program state? We
generally have two options:</p>

<ol>
  <li>Global variables. Yucky but easy. Frequently seen in tutorials.</li>
  <li>A <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code> pointer attached to the window.</li>
</ol>

<p>The second option takes some setup. Win32 passes the last <code class="language-plaintext highlighter-rouge">CreateWindowEx</code>
argument to the window procedure when the window created, via <code class="language-plaintext highlighter-rouge">WM_CREATE</code>.
The procedure attaches the pointer to its window as <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>. This
pointer is passed indirectly, through a <code class="language-plaintext highlighter-rouge">CREATESTRUCT</code>. So ultimately it
looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">case</span> <span class="n">WM_CREATE</span><span class="p">:</span>
        <span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="n">cs</span> <span class="o">=</span> <span class="p">(</span><span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="p">)</span><span class="n">lParam</span><span class="p">;</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">state</span> <span class="o">*</span><span class="p">)</span><span class="n">cs</span><span class="o">-&gt;</span><span class="n">lpCreateParams</span><span class="p">;</span>
        <span class="n">SetWindowLongPtr</span><span class="p">(</span><span class="n">hwnd</span><span class="p">,</span> <span class="n">GWLP_USERDATA</span><span class="p">,</span> <span class="p">(</span><span class="n">LONG_PTR</span><span class="p">)</span><span class="n">arg</span><span class="p">);</span>
        <span class="c1">// ...</span>
</code></pre></div></div>

<p>In future messages we can retrieve it with <code class="language-plaintext highlighter-rouge">GetWindowLongPtr</code>. Every time
I go through this I wish there was a better way. What if there was a fifth
window procedure parameter though which we could pass a context?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);
</code></pre></div></div>

<p>We’ll build just this as a trampoline. The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a>
passes the first four arguments in registers, and the rest are pushed on
the stack, including this new parameter. Our trampoline cannot just stuff
the extra parameter in the register, but will actually have to build a
stack frame. Slightly more complicated, but barely so.</p>

<h3 id="allocating-executable-memory">Allocating executable memory</h3>

<p>In previous articles, and in the programs where I’ve applied techniques
like this, I’ve allocated executable memory with <code class="language-plaintext highlighter-rouge">VirtualAlloc</code> (or <code class="language-plaintext highlighter-rouge">mmap</code>
elsewhere). This introduces a small challenge for solving the problem
generally: Allocations may be arbitrarily far from our code and data, out
of reach of relative addressing. If they’re further than 2G apart, we need
to encode absolute addresses, and in the simple case would just assume
they’re always too far apart.</p>

<p>These days I’ve more experience with executable formats, and allocation,
and I immediately see a better solution: Request a block of writable,
executable memory from the loader, then allocate our trampolines from it.
Other than being executable, this memory isn’t special, and <a href="/blog/2025/01/19/">allocation
works the usual way</a>, using functions unaware it’s executable. By
allocating through the loader, this memory will be part of our loaded
image, guaranteed to be close to our other code and data, allowing our JIT
compiler to assume <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models#small-code-model">a small code model</a>.</p>

<p>There are a number of ways to do this, and here’s one way to do it with
GNU-styled toolchains targeting COFF:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="nf">.section</span> <span class="nv">.exebuf</span><span class="p">,</span><span class="s">"bwx"</span>
        <span class="nf">.globl</span> <span class="nv">exebuf</span>
<span class="nl">exebuf:</span>	<span class="nf">.space</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span>
</code></pre></div></div>

<p>This assembly program defines a new section named <code class="language-plaintext highlighter-rouge">.exebuf</code> containing 2M
of writable (<code class="language-plaintext highlighter-rouge">"w"</code>), executable (<code class="language-plaintext highlighter-rouge">"x"</code>) memory, allocated at run time just
like <code class="language-plaintext highlighter-rouge">.bss</code> (<code class="language-plaintext highlighter-rouge">"b"</code>). We’ll treat this like an arena out of which we can
allocate all trampolines we’ll probably ever need. With careful use of
<code class="language-plaintext highlighter-rouge">.pushsection</code> this could be basic inline assembly, but I’ve left it as a
separate source. On the C side I retrieve this like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="n">Arena</span> <span class="nf">get_exebuf</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">extern</span> <span class="kt">char</span> <span class="n">exebuf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span><span class="p">];</span>
    <span class="n">Arena</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="n">exebuf</span><span class="p">,</span> <span class="n">exebuf</span><span class="o">+</span><span class="k">sizeof</span><span class="p">(</span><span class="n">exebuf</span><span class="p">)};</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately I have to repeat myself on the size. There are different
ways to deal with this, but this is simple enough for now. I would have
loved to define the array in C with the GCC <a href="https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html"><code class="language-plaintext highlighter-rouge">section</code> attribute</a>,
but as is usually the case with this attribute, it’s not up to the task,
lacking the ability to set section flags. Besides, by not relying on the
attribute, any C compiler could compile this source, and we only need a
GNU-style toolchain to create the tiny COFF object containing <code class="language-plaintext highlighter-rouge">exebuf</code>.</p>

<p>While we’re at it, a reminder of some other basic definitions we’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="n">Str</span> <span class="nf">clone</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which have been discussed at length in previous articles.</p>

<h3 id="trampoline-compiler">Trampoline compiler</h3>

<p>From here the plan is to create a function that accepts a <code class="language-plaintext highlighter-rouge">Wndproc5</code> and a
context pointer to bind, and returns a classic <code class="language-plaintext highlighter-rouge">WNDPROC</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Wndproc5</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>Our window procedure now gets a fifth argument with the program state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">my_wndproc</span><span class="p">(</span><span class="n">HWND</span><span class="p">,</span> <span class="n">UINT</span><span class="p">,</span> <span class="n">WPARAM</span><span class="p">,</span> <span class="n">LPARAM</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When registering the class we wrap it in a trampoline compatible with
<code class="language-plaintext highlighter-rouge">RegisterClass</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">),</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>
</code></pre></div></div>

<p>All windows using this class will readily have access to this state object
through their fifth parameter. It turns out setting up <code class="language-plaintext highlighter-rouge">exebuf</code> was the
more complicated part, and <code class="language-plaintext highlighter-rouge">make_wndproc</code> is quite simple!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Wndproc5</span> <span class="n">proc</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">thunk</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"</span><span class="se">\x48\x83\xec\x28</span><span class="s">"</span>      <span class="c1">// sub   $40, %rsp</span>
        <span class="s">"</span><span class="se">\x48\xb8</span><span class="s">........"</span>      <span class="c1">// movq  $arg, %rax</span>
        <span class="s">"</span><span class="se">\x48\x89\x44\x24\x20</span><span class="s">"</span>  <span class="c1">// mov   %rax, 32(%rsp)</span>
        <span class="s">"</span><span class="se">\xe8</span><span class="s">...."</span>              <span class="c1">// call  proc</span>
        <span class="s">"</span><span class="se">\x48\x83\xc4\x28</span><span class="s">"</span>      <span class="c1">// add   $40, %rsp</span>
        <span class="s">"</span><span class="se">\xc3</span><span class="s">"</span>                  <span class="c1">// ret</span>
    <span class="p">);</span>
    <span class="n">Str</span> <span class="n">r</span>   <span class="o">=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">thunk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">proc</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="mi">24</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span> <span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="mi">20</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rel</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rel</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">WNDPROC</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The assembly allocates a new stack frame, with callee shadow space, and
with room for the new argument, which also happens to re-align the stack.
It stores the new argument for the <code class="language-plaintext highlighter-rouge">Wndproc5</code> just above the shadow space.
Then calls into the <code class="language-plaintext highlighter-rouge">Wndproc5</code> without touching other parameters. There
are two “patches” to fill out, which I’ve initially filled with dots: the
context pointer itself, and a 32-bit signed relative address for the call.
It’s going to be very near the callee. The only thing I don’t like about
this function is that I’ve manually worked out the patch offsets.</p>

<p>It’s probably not useful, but it’s easy to update the context pointer at
any time if hold onto the trampoline pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_wndproc_arg</span><span class="p">(</span><span class="n">WNDPROC</span> <span class="n">p</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="o">+</span><span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, for instance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// multiple states</span>
    <span class="n">WNDPROC</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="c1">// ...</span>
    <span class="n">set_wndproc_arg</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>  <span class="c1">// switch states</span>
</code></pre></div></div>

<p>Though I expect the most common case is just creating multiple procedures:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WNDPROC</span> <span class="n">procs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>To my slight surprise these trampolines still work with an active <a href="https://learn.microsoft.com/en-us/windows/win32/secbp/control-flow-guard">Control
Flow Guard</a> system policy. Trampolines do not have stack unwind
entries, and I thought Windows might refuse to pass control to them.</p>

<p>Here’s a complete, runnable example if you’d like to try it yourself:
<a href="https://gist.github.com/skeeto/13363b78489b26bed7485ec0d6b2c7f8"><code class="language-plaintext highlighter-rouge">main.c</code> and <code class="language-plaintext highlighter-rouge">exebuf.s</code></a></p>

<h3 id="better-cases">Better cases</h3>

<p>This is more work than going through <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>, and real programs
have a small, fixed number of window procedures — typically one — so this
isn’t the best example, but I wanted to illustrate with a real interface.
Again, perhaps the best real use is a library with a weak custom allocator
interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">);</span>   <span class="c1">// no context pointer!</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>     <span class="c1">// "</span>
<span class="p">}</span> <span class="n">Allocator</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">arena_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>

<span class="c1">// ...</span>

    <span class="n">Allocator</span> <span class="n">perm_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">Allocator</span> <span class="n">scratch_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">scratch</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>Something to keep in my back pocket for the future.</p>

]]>
    </content>
  </entry>
  
  <entry>
    <title>Speculations on arenas and non-trivial destructors</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/10/16/"/>
    <id>urn:uuid:102e0e39-0078-4698-b2d2-b9454dfe5545</id>
    <updated>2025-10-16T20:11:22Z</updated>
    <category term="cpp"/>
    <content type="html">
      <![CDATA[<p>As I <a href="/blog/2025/09/30/">continue to reflect</a> on arenas and lifetimes in C++, I realized
that dealing with destructors is not so onerous. In fact, it does not even
impact <a href="/blog/2025/01/19/">my established arena usage</a>! That is, implicit RAII-style
deallocation at scope termination, which works even in plain old C. With a
small change we can safely place resource-managing objects in arenas, such
as those owning file handles, sockets, threads, etc. (Though the ideal
remains <a href="/blog/2024/10/03/">resource management avoidance</a> when possible.) We can also
place traditional, memory-managing C++ objects in arenas, too. Their own
allocations won’t come from the arena — either because they <a href="/blog/2024/09/04/">lack the
interfaces</a> to do so, or they’re simply ineffective at it (<a href="https://en.cppreference.com/w/cpp/memory/polymorphic.html">pmr</a>) —
but they will reliably clean up after themselves. It’s all exception-safe,
too. In this article I’ll update my arena allocator with this new feature.
The change requires one additional arena pointer member, a bit of overhead
for objects with non-trivial destructors, and no impact for other objects.</p>

<p>I continue to title this “speculations” because, unlike arenas in C, I
have not (yet?) put these C++ techniques into practice in real software. I
haven’t refined them through use. Even ignoring its standard library as I
do here, C++ is an enormously complex programming language — far more so
than C — and I’m less confident that I’m not breaking a rule by accident.
I only want to break rules with intention!</p>

<p>As a reminder here’s where we left things off:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Arena</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="n">T</span> <span class="o">*</span><span class="n">raw_alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
    <span class="kt">ptrdiff_t</span> <span class="n">pad</span>  <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">&amp;</span> <span class="p">(</span><span class="k">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">count</span> <span class="o">&gt;=</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span><span class="o">/</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">bad_alloc</span><span class="p">{};</span>  <span class="c1">// OOM policy</span>
    <span class="p">}</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+</span> <span class="n">pad</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+=</span> <span class="n">pad</span> <span class="o">+</span> <span class="n">count</span><span class="o">*</span><span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="k">new</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="n">T</span><span class="p">[</span><span class="n">count</span><span class="p">]{};</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I used <code class="language-plaintext highlighter-rouge">throw</code> when out of memory mainly to emphasize that this works, but
you’re free to pick whatever is appropriate for your program. Remember,
that’s the entire allocator, including implicit deallocation, sufficient
to fulfill the allocation needs for most programs, though they must be
designed for it. Also note that it’s now <code class="language-plaintext highlighter-rouge">raw_alloc</code>, as we’ll be writing
a new, enhanced <code class="language-plaintext highlighter-rouge">alloc</code> that builds upon this one.</p>

<p>Also a reminder on usage, I’ll draw on an old example, updated for C++:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">wchar_t</span>   <span class="o">*</span><span class="nf">towidechar</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>   <span class="c1">// convert to UTF-16</span>
<span class="n">Str</span>        <span class="nf">slurpfile</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">path</span><span class="p">);</span>   <span class="c1">// read an entire file</span>
<span class="n">Slice</span><span class="o">&lt;</span><span class="n">Str</span><span class="o">&gt;</span> <span class="n">split</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="kt">char</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// split on delimiter</span>

<span class="n">Slice</span><span class="o">&lt;</span><span class="n">Str</span><span class="o">&gt;</span> <span class="n">getlines</span><span class="p">(</span><span class="n">Str</span> <span class="n">path</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">,</span> <span class="n">Arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Use scratch for path conversion, auto-free on return</span>
    <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">wpath</span> <span class="o">=</span> <span class="n">towidechar</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>

    <span class="c1">// Use perm for file contents, which are returned</span>
    <span class="n">Str</span> <span class="n">buf</span> <span class="o">=</span> <span class="n">slurpfile</span><span class="p">(</span><span class="n">wpath</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>

    <span class="c1">// Use perm for the slice, pointing into buf</span>
    <span class="k">return</span> <span class="n">split</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Changes to <code class="language-plaintext highlighter-rouge">scratch</code> do not persist after <code class="language-plaintext highlighter-rouge">getlines</code> returns, so objects
allocated from that arena are automatically freed on return. So far this
doesn’t rely on C++ RAII features, just simple value semantics. It works
well because all the objects in question have trivial destructors. But
suppose there’s a resource to manage:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">TcpSocket</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">socket</span> <span class="o">=</span> <span class="o">::</span><span class="n">socket</span><span class="p">(</span><span class="n">AF_INET</span><span class="p">,</span> <span class="n">SOCK_STREAM</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">TcpSocket</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>
    <span class="n">TcpSocket</span><span class="p">(</span><span class="n">TcpSocket</span> <span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
    <span class="kt">void</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="n">TcpSocket</span> <span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>
    <span class="c1">// TODO: move ctor/operator</span>
    <span class="o">~</span><span class="n">TcpSocket</span><span class="p">()</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">socket</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="n">close</span><span class="p">(</span><span class="n">socket</span><span class="p">);</span> <span class="p">}</span>
    <span class="k">operator</span> <span class="kt">int</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">socket</span><span class="p">;</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>If we allocate a TcpSocket in an arena, including as a member of another
object, the destructor will never run unless we call it manually. To deal
with this we’ll need to keep track of objects requiring destruction, which
we’ll do with a linked list of destructors, forming a LIFO stack:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Dtor</span> <span class="p">{</span>
    <span class="n">Dtor</span>     <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="kt">void</span>     <span class="o">*</span><span class="n">objects</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">void</span>     <span class="p">(</span><span class="o">*</span><span class="n">dtor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="n">objects</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Each Dtor points to a homogeneous array, a count (typically one), and a
pointer to a function that knows how to destroy these objects. The linked
list itself is heterogeneous, with dynamic type. The function pointer is
like a kind of type tag. The <code class="language-plaintext highlighter-rouge">dtor</code> functions will be generated using a
template function:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="kt">void</span> <span class="nf">destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">T</span> <span class="o">*</span><span class="n">objects</span> <span class="o">=</span> <span class="p">(</span><span class="n">T</span> <span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">count</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">objects</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="o">~</span><span class="n">T</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice it destroys end-to-beginning, in reverse order that these objects
would be instantiated by placement <code class="language-plaintext highlighter-rouge">new[]</code>. It’s essentially a placement
<code class="language-plaintext highlighter-rouge">delete[]</code>. An arena initializes with an empty list of Dtors as a new
member:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Arena</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
    <span class="n">Dtor</span> <span class="o">*</span><span class="n">dtors</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// ...</span>

<span class="p">};</span>
</code></pre></div></div>

<p>There are two different ways to construct an arena: over a block of raw
memory (unowned), or from an existing arena to borrow a scratch arena over
its free space. So that’s two constructors:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Arena</span> <span class="p">{</span>
    <span class="c1">// ...</span>

    <span class="n">Arena</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">mem</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span> <span class="o">:</span> <span class="n">beg</span><span class="p">{</span><span class="n">mem</span><span class="p">},</span> <span class="n">end</span><span class="p">{</span><span class="n">mem</span><span class="o">+</span><span class="n">len</span><span class="p">}</span> <span class="p">{}</span>
    <span class="n">Arena</span><span class="p">(</span><span class="n">Arena</span> <span class="o">&amp;</span><span class="n">a</span><span class="p">)</span> <span class="o">:</span> <span class="n">beg</span><span class="p">{</span><span class="n">a</span><span class="p">.</span><span class="n">beg</span><span class="p">},</span> <span class="n">end</span><span class="p">{</span><span class="n">a</span><span class="p">.</span><span class="n">end</span><span class="p">}</span> <span class="p">{}</span>

    <span class="c1">// ...</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Finally a destructor that pops the Dtor linked list until empty, which
runs the destructors in reverse order when the arena is destroyed:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Arena</span> <span class="p">{</span>
    <span class="c1">// ...</span>

    <span class="kt">void</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="n">Arena</span> <span class="o">&amp;</span><span class="p">)</span> <span class="o">=</span> <span class="k">delete</span><span class="p">;</span>  <span class="c1">// rule of three</span>

    <span class="o">~</span><span class="n">Arena</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">dtors</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">Dtor</span> <span class="o">*</span><span class="n">dead</span> <span class="o">=</span> <span class="n">dtors</span><span class="p">;</span>
            <span class="n">dtors</span> <span class="o">=</span> <span class="n">dead</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
            <span class="n">dead</span><span class="o">-&gt;</span><span class="n">dtor</span><span class="p">(</span><span class="n">dead</span><span class="o">-&gt;</span><span class="n">objects</span><span class="p">,</span> <span class="n">dead</span><span class="o">-&gt;</span><span class="n">count</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>(Note: This should probably use a local variable instead of manipulating
the <code class="language-plaintext highlighter-rouge">dtors</code> member directly. Updates to <code class="language-plaintext highlighter-rouge">dtors</code> are potentially visible to
destructors, inhibiting optimization.) The new, enhanced <code class="language-plaintext highlighter-rouge">alloc</code> building
upon <code class="language-plaintext highlighter-rouge">raw_alloc</code>:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="n">T</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">__has_trivial_destructor</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">||</span> <span class="o">!</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">raw_alloc</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="n">Dtor</span> <span class="o">*</span><span class="n">dtor</span>    <span class="o">=</span> <span class="n">raw_alloc</span><span class="o">&lt;</span><span class="n">Dtor</span><span class="o">&gt;</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>  <span class="c1">// allocate first</span>
    <span class="n">T</span>    <span class="o">*</span><span class="n">r</span>       <span class="o">=</span> <span class="n">raw_alloc</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
    <span class="n">dtor</span><span class="o">-&gt;</span><span class="n">next</span>    <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">dtors</span><span class="p">;</span>
    <span class="n">dtor</span><span class="o">-&gt;</span><span class="n">objects</span> <span class="o">=</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">dtor</span><span class="o">-&gt;</span><span class="n">count</span>   <span class="o">=</span> <span class="n">count</span><span class="p">;</span>
    <span class="n">dtor</span><span class="o">-&gt;</span><span class="n">dtor</span>    <span class="o">=</span> <span class="n">destroy</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">;</span>

    <span class="n">a</span><span class="o">-&gt;</span><span class="n">dtors</span> <span class="o">=</span> <span class="n">dtor</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m using the non-standard <code class="language-plaintext highlighter-rouge">__has_trivial_destructor</code> built-in supported
by all major C++ implementations, meaning we still don’t need the C++
standard library, but <a href="https://en.cppreference.com/w/cpp/types/is_destructible.html"><code class="language-plaintext highlighter-rouge">std::is_trivially_destructible</code></a> is the usual
tool here. <a href="https://clang.llvm.org/docs/LanguageExtensions.html#:~:text=__has_trivial_destructor">LLVM is pushing <code class="language-plaintext highlighter-rouge">__is_trivially_destructible</code></a> instead,
but it’s not supported by GCC <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107600">until GCC 16</a>.</p>

<p>Since it’s so simple to do it, if the count is zero then it doesn’t care
about non-trivial destruction, as there’s nothing to destroy. Things get
more interesting for a non-zero number of non-trivially destructible
objects. First allocate a Dtor, important because failing to allocate it
second would cause a leak (no Dtor entry in place). Then allocate the
array, attach it to the Dtor, attach the Dtor to the arena, registering
the objects for cleanup.</p>

<p>If a constructor throws, placement <code class="language-plaintext highlighter-rouge">new[]</code> will automatically destroy
objects that have been created so far — i.e. the real placement <code class="language-plaintext highlighter-rouge">delete[]</code>
— before returning, so that case was already covered at the start.</p>

<p>With a little more cleverness we could omit the <code class="language-plaintext highlighter-rouge">objects</code> pointer and
discover the array using pointer arithmetic off the Dtor object itself.
That’s tricky (consider alignment), and generally unnecessary, so I didn’t
worry about it. With arenas, allocator overhead is already well below that
of conventional allocation, so slack is plentiful. Chances are we will
also never need an <em>array</em> of non-trivially destructible objects, and so
we could probably omit <code class="language-plaintext highlighter-rouge">count</code>, then write a single-object allocator that
forwards constructor arguments (e.g. a handles to the resource to be
managed). That involves no new concepts, and I leave it as an exercise for
the reader.</p>

<p>With that in place, we could now allocate an array of TcpSockets:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="n">Arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TcpSocket</span> <span class="o">*</span><span class="n">sockets</span> <span class="o">=</span> <span class="n">alloc</span><span class="o">&lt;</span><span class="n">TcpSocket</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="mi">100</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These sockets will all be closed when <code class="language-plaintext highlighter-rouge">example</code> exits via their singular
Dtor entry on <code class="language-plaintext highlighter-rouge">scratch</code>. When calling this <code class="language-plaintext highlighter-rouge">example</code> with an arena:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">caller</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">example</span><span class="p">(</span><span class="o">*</span><span class="n">perm</span><span class="p">);</span>  <span class="c1">// creates a scratch arena</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This invokes the copy constructor, creating a scratch arena with an empty
<code class="language-plaintext highlighter-rouge">dtors</code> list to be passed into <code class="language-plaintext highlighter-rouge">example</code>. Objects existing in <code class="language-plaintext highlighter-rouge">*perm</code> will
not be destroyed by <code class="language-plaintext highlighter-rouge">example</code> because <code class="language-plaintext highlighter-rouge">dtors</code> isn’t passed in. If we had
passed a <em>pointer to an arena</em>, the Arena constructor isn’t invoked, so
the callee uses the caller’s arena, pushing its Dtors onto the callee’s
list.</p>

<p>In other words, the interface hasn’t changed! That’s the most exciting
part for me. This by-copy, by-pointer interfacing has really grown on me
the past two years.</p>

]]>
    </content>
  </entry>
  

</feed>
