<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged c at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/c/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/c/feed/"/>
  <updated>2026-04-07T03:24:16Z</updated>
  <id>urn:uuid:0ba7b921-5597-4fbc-8ceb-88afb378c637</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  <entry>
    <title>2026 has been the most pivotal year in my career… and it's only March</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/03/29/"/>
    <id>urn:uuid:91d679b3-4f07-4b61-b359-5890695ad621</id>
    <updated>2026-03-29T21:38:22Z</updated>
    <category term="ai"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>In February I left my employer after nearly two decades of service. In the
moment I was optimistic, yet unsure I made the right choice. Dust settled,
I’m now absolutely sure I chose correctly. I’m happier and better for it.
There were multiple factors, but it’s not mere chance it coincides with
these early months of <a href="https://shumer.dev/something-big-is-happening">the automation of software engineering</a>. I
left an employer that is <em>years behind</em> adopting AI to one actively
supporting and encouraging it. As of March, in my professional capacity
<strong>I no longer write code myself</strong>. My current situation was unimaginable
to me only a year ago. Like it or not, this is the future of software
engineering. Turns out I like it, and having tasted the future I don’t
want to go back to the old ways.</p>

<p>In case you’re worried, this is still me. These are my own words. <a href="https://paulgraham.com/writes.html">Writing
is thinking</a>, and it would defeat the purpose for an AI to write
in my place on my personal blog. That’s not going to change.</p>

<p>I still spend much time reading and understanding code, and using most of
the same development tools. It’s more like being a manager, orchestrating
a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing
the vegetables, I conjure a helper to do it while I continue to run the
kitchen. I haven’t managed people in some 20 years now, but I can feel
those old muscles being put to use again as I improve at this new role.
Will these kitchens still need human chefs like me by the end of the
decade? Unclear, and it’s something we all need to prepare for.</p>

<p>My situation gave me an experience onboarding with AI assistance — a fast
process given a near-instant, infinitely-patient helper answering any
question about the code. By second week I was making substantial, wide
contributions to the large C++ code base. It’s difficult to attach a
quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for
certain this wouldn’t have been possible without AI. The bottlenecks have
shifted from producing code, which now takes relatively no time at all, to
other points, and we’re all still trying to figure it out.</p>

<p>My personal programming has transformed as well. Everything <a href="/blog/2024/11/10/">I said about
AI in late 2024</a> is, as I predicted, utterly obsolete. There’s a
huge, growing gap between open weight models and the frontier. Models you
can run yourself are toys. In general, almost any AI product or service
worth your attention costs money. The free stuff is, at minimum, months
behind. Most people only use limited, free services, so there’s a broad
unawareness of just how far AI has advanced. AI is <em>now highly skilled at
programming</em>, and better than me at almost every programming task, with
inhumanly-low defect rates. The remaining issues are mainly steering
problems: If AI code doesn’t do what I need, likely the AI writing it
didn’t understand what I needed.</p>

<p>I’ll still write code myself from time to time for fun — <a href="/blog/2018/06/10/">minimalist</a>,
with my <a href="/blog/2023/10/08/">style</a> and <a href="/blog/2025/01/19/">techniques</a> — the same way I play <a href="https://en.wikipedia.org/wiki/Shogi">shogi</a> on
the weekends for fun. However, artisan production is uneconomical in the
presence of industrialization. AI makes programming so cheap that only the
rich will write code by hand.</p>

<p>A small part of me is sad at what is lost. A bigger part is excited about
the possibilities of the future. I’ve always had more ideas than time or
energy to pursue them. With AI at my command, the problem changes shape. I
can comfortably take on complexity from which I previously shied away, and
I can take a shot at any idea sufficiently formed in my mind to prompt an
AI — a whole skill of its own that I’m actively developing.</p>

<p>For instance, a couple weeks ago I <a href="https://github.com/skeeto/w64devkit/pull/357">put AI to work on a problem</a>,
and it produced a working solution for me after ~12 hours of continuous,
autonomous work, literally while I slept. The past month <a href="https://github.com/skeeto/w64devkit">w64devkit</a> has
burst with activity, almost entirely AI-driven. Some of it architectural
changes I’ve wanted for years, but would require hours of tedious work,
and so I never got around to it. AI knocked it out in minutes, with the
new architecture opening new opportunities. It’s also taken on most of the
cognitive load of maintenance.</p>

<h3 id="quiltcpp">Quilt.cpp</h3>

<p>So far the my biggest, successful undertaking is <strong><a href="https://github.com/skeeto/quilt.cpp">Quilt.cpp</a></strong>, a C++
clone of <a href="https://savannah.nongnu.org/projects/quilt">Quilt</a>, an early, actively-used source control system for
patch management. Git is a glaring omission from the <a href="/blog/2020/09/25/">almost</a> complete
w64devkit, due platform and build issues. I’ve thought Quilt could fill
<em>some</em> of that source control hole, except the original is written in
Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since
Quilt is conceptually simple, and I could lean on <a href="https://frippery.org/busybox/">busybox-w32</a> <code class="language-plaintext highlighter-rouge">diff</code>
and <code class="language-plaintext highlighter-rouge">patch</code>, I’ve considered writing my own implementation, just <a href="/blog/2023/01/18/">as I did
pkg-config</a>, but I never found the energy to do it.</p>

<p>Then I got good enough with AI to knock out a near feature-complete clone
in about four days, including a built-in <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code> so it doesn’t
actually depend on external tools (except invoking <code class="language-plaintext highlighter-rouge">$EDITOR</code>). On Windows
it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases.
The source is distributed as an amalgamation, a single file <code class="language-plaintext highlighter-rouge">quilt.cpp</code>
per its namesake:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file] &lt;command&gt; [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt &lt;command&gt; --help" for details on a specific command.
</code></pre></div></div>

<p>It supports Windows and POSIX, and runs ~5x faster than the original. AI
developed it on Windows, Linux, and macOS: It’s best when the AI can close
the debug loop and tackle problems autonomously without involving a human
slowpoke. The handful of “not implemented” parts aren’t because they’re
too hard — each would probably take an AI ~10 minutes — but deliberate
decisions of taste.</p>

<p>There’s an irony that the reason I could produce Quilt.cpp with such ease
is also a reason I don’t really need it anymore.</p>

<p>I changed the output of <code class="language-plaintext highlighter-rouge">quilt mail</code> to be more Git-compatible. The mbox
produced by Quilt.cpp can be imported into Git with a plain <code class="language-plaintext highlighter-rouge">git am</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox
</code></pre></div></div>

<p>The idea being that I could work on a machine without Git (e.g. Windows
XP), and copy/mail the mbox to another machine where Git can absorb it as
though it were in Git the whole time. <code class="language-plaintext highlighter-rouge">git format-patch</code> to <code class="language-plaintext highlighter-rouge">quilt import</code>
sends commits in the opposite direction, useful for manually testing
Quilt.cpp on real change sets.</p>

<p>To be clear, I could not have done this if the original Quilt did not
exist as a working program. I began with an AI generating a <a href="https://eli.thegreenplace.net/2026/rewriting-pycparser-with-the-help-of-an-llm/">conformance
suite</a> based on the original, its documentation, and other online
documentation, validating that suite against the original implementation
(see <code class="language-plaintext highlighter-rouge">-DQUILT_TEST_EXECUTABLE</code>). Then had another AI code to the tests, on
architectural guidance from me, with <code class="language-plaintext highlighter-rouge">-D_GLIBCXX_DEBUG</code> and sanitizers as
guardrails. That was day one. The next three days were lots of refining
and iteration as I discover the gaps in the test suite. I’d prompt AI to
compare Quilt.cpp to the original Quilt man page, add tests for missing
features, validate the new tests against the original Quilt, then run
several agents to fix the tests. While they worked I’d try the latest
build and note any bugs. As of this writing, the result is about equal
parts test and non-test, ~9KLoC each.</p>

<p>I’m likely to use this technique to clone other tools with implementations
unsuitable for my purposes. I learned quite a bit from this first attempt.</p>

<p>Why C++ instead of my usual choice of C? As we know, <a href="/blog/2023/02/11/">conventional C is
highly error-prone</a>. Even AI has trouble with it. In the ~9k lines
of C++ that is Quilt.cpp, I am only aware of three memory safety errors by
the AI. Two were null-terminated string issues with <code class="language-plaintext highlighter-rouge">strtol</code>, where the AI
was essentially writing C instead of C++, after which I directed the AI to
use <code class="language-plaintext highlighter-rouge">std::from_chars</code> and drop as much direct libc use as possible. (The
other was an unlikely branch with <code class="language-plaintext highlighter-rouge">std::vector::back</code> on an empty vector.)
We can rescue C with better techniques like arena allocation, counted
strings, and slices, but while (current) state of the art AI understands
these things, it cannot work effectively with them in C. I’ve tried. So I
picked C++, and from my professional work I know AI is better at C++ than
me.</p>

<p>Also like a manager, I have not read most of the code, and instead focused
on results, so you might say this was “vibe-coded.” It <em>is</em> thoroughly
tested, though I’m sure there are still bugs to be ironed out, especially
on the more esoteric features I haven’t tried by hand yet.</p>

<h3 id="lets-discuss-tools">Let’s discuss tools</h3>

<p>After opposing CMake for years, you may have noticed the latest w64devkit
now includes CMake and Ninja. What happened? Preparing for my anticipated
employment change, this past December I read <a href="https://crascit.com/professional-cmake/"><em>Professional CMake</em></a>.
I realized that my practical problems with CMake were that nearly everyone
uses it incorrectly. Most CMake builds are a disaster, but my new-found
knowledge allows me to navigate the common mistakes. Only high profile
open source projects manage to put together proper CMake builds. Otherwise
the internet is loaded with CMake misinformation. Similar to AI, if you’re
not paying for CMake knowledge then it’s likely wrong or misleading. So I
highly recommend that book!</p>

<p>Frontier AI is <em>very good</em> with CMake. When a project has a CMake build
that isn’t <em>too</em> badly broken, just tell AI to fix it, <em>without any
specifics</em>, and build problems disappear in mere minutes without having to
think about it. It’s awesome. Combine it with the previous discussion
about tests making AI so much more effective, and that it <em>also</em> knows
CTest well, and you’ve got a killer formula. I’m more effective with CTest
myself merely from observing how AI uses it. AI (currently) cannot use
debuggers, so putting powerful, familiar testing tools in its hands helps
a lot, versus the usual bespoke, debugger-friendly solutions I prefer.</p>

<p>Similar to solving CMake problems: Have a hairy merge conflict? Just ask
AI resolve it. It’s like magic. I no longer fear merge conflicts.</p>

<p>So part of my motivation for adding CMake to w64devkit was anticipation of
projects like Quilt.cpp, where they’d be available to AI, or at least so I
could use the tools the AI used to build/test myself. It’s already paid
for itself, and there’s more to come.</p>

<p>For agent software, on personal projects I’m using Claude Code. It’s a
great value, cheaper than paying API rates but requires working around
5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so
much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to
have enough to explore all my ideas. Be warned: <strong>Anthropic software is
quite buggy, more so than industry average</strong>, and it’s obvious that they
never even <em>start</em>, let alone test, some of their released software on
disfavored platforms (Windows, Android). Don’t expect to use Claude Code
effectively for native Windows platform development, which sadly includes
w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a
bottleneck on QA, and unable to fit AI in that role they don’t bother. You
can theoretically report bugs on GitHub, but they’re just ignored and
closed. (Why don’t they have AI agents jumping on this wealth of bug
reports?)</p>

<p>At work I’m using Cursor where I get a choice of models. My favorite for
March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude
Code by a small margin. It’s immediately obvious that Cursor is better
agent software than Claude Code. It’s more robust, more featureful, and
with a clearer UI than Claude Code. It has no trouble on Windows and can
drive w64devkit flawlessly. It’s also more expensive than Claude Code. My
employer currently spends ~US$250/mo on my AI tokens, dirt cheap
considering what they’re getting out of it. I have bottlenecks elsewhere
that keep me from spending even more.</p>

<p>Neither Cursor nor Claude Code are open source, so what are the purists to
do, even if they’re willing to pay API rates for tokens? Sadly I have no
answers for you. I haven’t gotten any open source agent software actually
working, and it seems they may lack the necessary secret sauce.</p>

<p>Update: Several folks suggested I give <a href="https://opencode.ai/">OpenCode</a> another shot, and this
time I got over the configuration hurdle. Single executable, slick
interface, and unlike Claude Code, I observed no bugs in my brief trial.
Give that a shot if you’re looking for an open source client.</p>

<p>The future is going to be weird. My experience is only a peek at what’s to
come, and my head is still spinning. However, the more I adapt to the
changes, the better I feel. If you’re feeling anxious like I was, don’t
flinch from improving your own AI knowledge and experience.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Frankenwine: Multiple personas in a Wine process</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/19/"/>
    <id>urn:uuid:d2b53f8d-88a6-400b-a748-693a758741c5</id>
    <updated>2026-01-19T21:51:38Z</updated>
    <category term="c"/><category term="win32"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I came across a recent article on <a href="https://gpfault.net/posts/drunk-exe.html">making Linux system calls from a Wine
process</a>. Windows programs running under Wine are still normal Linux
processes and may interact with the Linux kernel like any other process.
None of this was surprising, and the demonstration works just as I expect.
Still, it got the wheels spinning and I realized an <em>almost</em> practical
application: build <a href="/blog/2023/01/18/">my pkg-config implementation</a> such that on Windows
<code class="language-plaintext highlighter-rouge">pkg-config.exe</code> behaves as a native pkg-config, but when run under Wine
this same binary takes the persona of a Linux program and becomes a cross
toolchain pkg-config, bypassing Win32 and talking directly with the Linux
kernel. <a href="https://justine.lol/cosmopolitan/">Cosmopolitcan Libc</a> cleverly does this out-of-the-box, but
in this article we’ll mash together a couple existing sources with a bit
of glue.</p>

<p>The results are in <a href="https://github.com/skeeto/u-config/commit/e0008d7e">the merge-demo branch</a> of u-config, and took
hardly any work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)
</code></pre></div></div>

<p>A platform layer, <code class="language-plaintext highlighter-rouge">main_wine.c</code>, is a merge of two existing platform
layers, one of which required unavoidable tweaks. We’ll get to those
details in a moment. First we’ll need to detect if we’re running under
Wine, and <a href="https://web.archive.org/web/20250923061634/https://stackoverflow.com/questions/7372388/determine-whether-a-program-is-running-under-wine-at-runtime/42333249#42333249">the best solution I found</a> was to locate
<code class="language-plaintext highlighter-rouge">ntdll!wine_get_version</code>. If this function exists, we’re in Wine. That
works out to a pretty one-liner because <code class="language-plaintext highlighter-rouge">ntdll.dll</code> is already loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">running_on_wine</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">GetModuleHandleA</span><span class="p">(</span><span class="s">"ntdll"</span><span class="p">),</span> <span class="s">"wine_get_version"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>An x86-64 Linux syscall wrapper with <a href="/blog/2024/12/20/">thorough inline assembly</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">syscall3</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">b</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="p">(</span><span class="kt">ptrdiff_t</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’d normally use <code class="language-plaintext highlighter-rouge">long</code> for all these integers because Linux is <a href="https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models">LP64</a>
(<code class="language-plaintext highlighter-rouge">long</code> is pointer-sized), but Windows is LLP64 (only <code class="language-plaintext highlighter-rouge">long long</code> is 64
bits). It’s so bizarre to interface with Linux from LLP64, and this will
have consequences later. With these pieces we can see the basic shape of a
split personality program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">running_on_wine</span><span class="p">())</span> <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello, wine</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
        <span class="n">WriteFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"hello, windows</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We can cram two programs into this binary and select which program at run
time depending on what we see. In typical programs locating and calling
into glibc would be a challenge, particularly with the incompatible ABIs
involved. We’re avoiding it here by interfacing directly with the kernel.</p>

<h3 id="application-to-u-config">Application to u-config</h3>

<p>Luckily u-config has completely-optional platform layers implemented with
Linux system calls. The POSIX platform layer works fine, and that’s what
distributions should generally use, but these bonus platforms are unhosted
and do not require libc. That means we can shove it into a Windows build
with relatively little trouble.</p>

<p>Before we do that, let’s think about what we’re doing. <a href="/blog/2021/08/21/">Debian has great
cross toolchain support</a>, including Mingw-w64. There are even a few
Windows libraries in the Debian package repository, <a href="https://packages.debian.org/trixie/x32/libz-mingw-w64">such as zlib</a>, and
we can build Windows programs against them. If you’re cross-building and
using pkg-config, you ought to use the cross toolchain pkg-config, which
in GNU ecosystems gets an architecture prefix like the other cross tools.
Debian cross toolchains each include a cross pkg-config, and it sometimes
<em>almost</em> works correctly! Here’s what I get on Debian 13:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz
</code></pre></div></div>

<p>Note the architecture in the <code class="language-plaintext highlighter-rouge">-I</code> and <code class="language-plaintext highlighter-rouge">-L</code> options. It really is querying
the <a href="https://peter0x44.github.io/posts/cross-compilers/">cross sysroot</a>. Though these paths are in the cross sysroot,
and so should not be listed by pkg-config. It’s unoptimal and indicates
this pkg-config is probably misconfigured. In other cases it’s far from
correct:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...
</code></pre></div></div>

<p>A tool prefixed <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-</code> should not produce paths containing
<code class="language-plaintext highlighter-rouge">x86_64-linux-gnu</code> (the host architecture in this case). Our version won’t
have these issues.</p>

<p>The u-config platform interface is five functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// read whole files</span>
<span class="n">s8node</span> <span class="o">*</span><span class="nf">os_listing</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">);</span>  <span class="c1">// list directories</span>
<span class="kt">void</span>    <span class="nf">os_write</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>          <span class="c1">// standard out/err</span>
<span class="kt">void</span>    <span class="nf">os_fail</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="p">);</span>                       <span class="c1">// non-zero exit</span>

<span class="kt">void</span> <span class="nf">uconfig</span><span class="p">(</span><span class="n">config</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Platforms implement the first four functions, and call <code class="language-plaintext highlighter-rouge">uconfig()</code> with
the platform’s configuration, context pointer (<code class="language-plaintext highlighter-rouge">os *</code>), command line
arguments, environment, and some memory (all in the <code class="language-plaintext highlighter-rouge">config</code> object). My
strategy is to link two platforms into the binary, and the first challenge
is they both define <code class="language-plaintext highlighter-rouge">os_write</code>, etc. I did not plan nor intend for one
binary to contain more than one platform layer. Unity builds offer a fix
without changing a single line of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include</span> <span class="cpf">"main_windows.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span>
<span class="cp">#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include</span> <span class="cpf">"main_linux_amd64.c"</span><span class="cp">
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail
</span></code></pre></div></div>

<p>This dirty, but effective trick <a href="/blog/2025/02/05/">may look familiar</a>. It also doesn’t
interfere with the other builds. Next I define the real platform functions
as a dispatch based on our run-time situation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b32</span> <span class="n">wine_detected</span><span class="p">;</span>

<span class="n">filemap</span> <span class="nf">os_mapfile</span><span class="p">(</span><span class="n">os</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">s8</span> <span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">linux_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">win32_mapfile</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I were serious about keeping this experiment, I’d lift <code class="language-plaintext highlighter-rouge">os</code> as I did
the functions (as <code class="language-plaintext highlighter-rouge">win32_os</code>, <code class="language-plaintext highlighter-rouge">linux_os</code>) and include <code class="language-plaintext highlighter-rouge">wine_detected</code> in
the context, eliminating this global variable. That cannot be done with
simple hacks and macros.</p>

<p>The next challenge is that I wrote the Linux platform layer assuming LP64,
and so it uses <code class="language-plaintext highlighter-rouge">long</code> instead of an equivalent platform-agnostic type like
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code>. I never thought this would be an issue because this source
literally contains <code class="language-plaintext highlighter-rouge">asm</code> blocks and no conditional compilation, yet here
we are. Lesson learned. I wanted to try an extremely janky <code class="language-plaintext highlighter-rouge">#define</code> on
<code class="language-plaintext highlighter-rouge">long</code> to fix it, but this source file has a couple <code class="language-plaintext highlighter-rouge">long long</code> that won’t
play along. These multi-token type names of C are antithetical to its
preprocessor! So I adjusted the source manually instead.</p>

<p>The Windows and Linux platform entry points are completely different, both
in name and form, and so co-exist naturally. The merged platform layer is
a new entry point that will pass control to the appropriate entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">entrypoint</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>  <span class="c1">// Linux</span>
<span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">mainCRTStartup</span><span class="p">();</span>    <span class="c1">// Windows</span>
</code></pre></div></div>

<p>On Linux <code class="language-plaintext highlighter-rouge">stack</code> is <a href="/blog/2025/03/06/">the initial value of the stack pointer</a>, which
<a href="https://articles.manugarg.com/aboutelfauxiliaryvectors">points to <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, <code class="language-plaintext highlighter-rouge">envp</code>, and <code class="language-plaintext highlighter-rouge">auxv</code></a>. We’ll need construct
an artificial “stack” for the Linux platform layer to harvest. On Windows
this is <a href="/blog/2023/02/15/">the process entry point</a>, and it will find the rest on its
own as a normal Windows process. Ultimately this ended up simpler than I
expected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="kr">__stdcall</span> <span class="nf">merge_entrypoint</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">wine_detected</span> <span class="o">=</span> <span class="n">running_on_wine</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">wine_detected</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="o">*</span><span class="n">fakestack</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">c16</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">GetCommandLineW</span><span class="p">();</span>
        <span class="n">fakestack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)(</span><span class="n">iz</span><span class="p">)</span><span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">fakestack</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="c1">// TODO: append envp to the fake stack</span>
        <span class="n">entrypoint</span><span class="p">((</span><span class="n">iz</span> <span class="o">*</span><span class="p">)</span><span class="n">fakestack</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">mainCRTStartup</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Where <a href="/blog/2022/02/18/"><code class="language-plaintext highlighter-rouge">cmdline_to_argv8</code> is my Windows argument parser</a>, already
used by u-config, and I reserve one element at the front to store <code class="language-plaintext highlighter-rouge">argc</code>.
Since this is just a proof-of-concept I didn’t bother fabricating and
pushing <code class="language-plaintext highlighter-rouge">envp</code> onto the fake stack. The Linux entry point doesn’t need
<code class="language-plaintext highlighter-rouge">auxv</code> and can be omitted. Once in the Linux entry point it’s essentially
a Linux process from then on, except the x64 calling convention still in
use internally.</p>

<p>Finally, I configure the Linux platform layer for Debian’s cross sysroot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include</span><span class="cpf">"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "</span><span class="c1">/usr/x86_64-w64-mingw32/lib"</span><span class="cp">
</span></code></pre></div></div>

<p>And that’s it! We have our platform merge. Build (<a href="https://github.com/skeeto/w64devkit">w64devkit</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c
</code></pre></div></div>

<p>On Debian use <code class="language-plaintext highlighter-rouge">x86_64-w64-mingw32-gcc</code> for <code class="language-plaintext highlighter-rouge">cc</code>. The <code class="language-plaintext highlighter-rouge">-e</code> linker option
selects the new, higher level entry point. After installing <a href="https://packages.debian.org/trixie/wine-binfmt">Wine
binfmt</a>, here’s how it looks on Debian:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs zlib
-lz
</code></pre></div></div>

<p>That’s the correct output, but is it using the cross sysroot? Ask it to
include the <code class="language-plaintext highlighter-rouge">-I</code> argument despite it being in the cross sysroot:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz
</code></pre></div></div>

<p>Looking good! It passes the <code class="language-plaintext highlighter-rouge">pc_path</code> test, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig
</code></pre></div></div>

<p>Running <em>this same binary</em> on Windows after installing zlib in w64devkit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz
</code></pre></div></div>

<p>Also:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig
</code></pre></div></div>

<p>My Frankenwine is a success!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>WebAssembly as a Python extension platform</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/01/01/"/>
    <id>urn:uuid:91e7555d-950f-47c6-84b8-bee0070f61a9</id>
    <updated>2026-01-01T21:21:19Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>Software above some complexity level tends to sport an extension language,
becoming a kind of software platform itself. Lua fills this role well, and
of course there’s JavaScript for web technologies. <a href="/blog/2025/04/04/">WebAssembly</a>
generalizes this, and any Wasm-targeting programming language can extend a
Wasm-hosting application. It has more friction than supplying a script in
a text file, but extension authors can write in their language of choice,
and use more polished development tools — debugging, <a href="/blog/2025/02/05/">testing</a>, etc.
— than typically available for a typical extension language. Python is
traditionally extended through native code behind a C interface, but it’s
recently become practical to extend Python with Wasm. That is we can ship
an architecture-independent Wasm blob inside a Python library, and use it
without requiring a native toolchain on the host system. Let’s discuss two
different use cases and their pitfalls.</p>

<p>Normally we’d extend Python in order to access an external interface that
Python cannot access on its own. Wasm runs in a sandbox with no access to
the outside world whatsoever, so it obviously isn’t useful for that case.
Extensions may also grant Python more speed, which is one of Wasm’s main
selling points. We can also use Wasm to access <em>embeddable capabilities</em>
written in a different programming language which do not require external
access.</p>

<p>For preferred non-WASI Wasm runtime is Volodymyr Shymanskyy’s <a href="https://github.com/wasm3/wasm3">wasm3</a>.
It’s plain old C and very friendly to embedding in the same was as, say,
SQLite. Performance is middling, though a C program running on wasm3 is
still quite a bit faster than an equivalent Python program. It has Python
bindings, <a href="https://github.com/wasm3/pywasm3">pywasm3</a>, but it’s distributed only in source code form. That
is, the host machine must have a C toolchain in order to use pywasm3,
which defeats my purposes here. If there’s a C toolchain, I might as well
just use that instead of going through Wasm.</p>

<p>For the use cases in this article, the best option is <a href="https://github.com/bytecodealliance/wasmtime-py">wasmtime-py</a>. The
distribution includes binaries for Windows, macOS, and Linux on x86-64 and
ARM64, which covers nearly all Python installations. Hosts require nothing
more than a Python interpreter, no native toolchains. It’s almost as good
as having Wasm built into Python itself. In my tests it’s 3x–10x faster
than wasm3, so for my first use case the situation is even better. The
catch is that it currently weighs ~18MiB (installed), and in the future
will likely rival the Python interpreter itself. The API also breaks on a
monthly basis, so you’re signing up for the upgrade treadmill lest your
own program perishes to bitrot after a couple of years. This article is
about version 40.</p>

<h3 id="usage-examples-and-gotchas">Usage examples and gotchas</h3>

<p>The <a href="https://github.com/bytecodealliance/wasmtime-py/tree/main/examples">official examples</a> don’t do anything non-trivial or interesting,
and so to figure things out I had to study <a href="https://bytecodealliance.github.io/wasmtime-py/">the documentation</a>,
which does not offer many hints. Basic setup looks like this:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">functools</span>
<span class="kn">import</span> <span class="nn">wasmtime</span>

<span class="n">store</span>    <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Store</span><span class="p">()</span>
<span class="n">module</span>   <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Module</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">store</span><span class="p">.</span><span class="n">engine</span><span class="p">,</span> <span class="s">"example.wasm"</span><span class="p">)</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Instance</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">module</span><span class="p">,</span> <span class="p">())</span>
<span class="n">exports</span>  <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">exports</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>

<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">get_buffer_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>
<span class="n">func1</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func1"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
<span class="n">func2</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func2"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
<span class="n">func3</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func3"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
</code></pre></div></div>

<p>A store is an allocation region from which we allocate all Wasm objects.
It is not possible to free individual objects except to discard the whole
store. Quite sensible, honestly. What’s <em>not</em> sensible is how often I have
to repeat myself, passing the store back into every object in order to use
it. These objects are associated with exactly one store and cannot be used
with different stores. <a href="https://docs.wasmtime.dev/api/wasmtime/struct.Store.html#cross-store-usage-of-items">Use the wrong store and it panics</a>: It’s
already keeping track internally! I do not understand why the interface
works this way. So to make things simpler, I use <code class="language-plaintext highlighter-rouge">functools.partial</code> to
bind the <code class="language-plaintext highlighter-rouge">store</code> parameter and so get the interface I expect.</p>

<p>The <code class="language-plaintext highlighter-rouge">get_buffer_ptr</code> object is a buffer protocol object, and if you’re
moving anything other than bytes that’s probably what you want to use to
access memory. The usual caveats apply for this object: If you <a href="/blog/2025/04/19/">change the
memory size</a> you probably want to grab a fresh buffer object. For
bytes (e.g. buffers and strings) I prefer the <code class="language-plaintext highlighter-rouge">read</code> and <code class="language-plaintext highlighter-rouge">write</code> methods.</p>

<p>Because <a href="https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md">multi-value</a> is still in an experimental state in the Wasm
ecosystem, you will likely not pass structs with Wasm. Anything more
complicated than scalars will require pointers and copying data in and out
of Wasm linear memory. This involves the usual trap that catches nearly
everyone: Wasm interfaces make no distinction between pointers and
integers, and Wasm runtimes interpret generally interpret all integers as
signed. What that means is <strong>your pointers are signed unless you take
action</strong>. Addresses start at 0, so this is bad, bad news.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">malloc</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"func1"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>

<span class="n">hello</span> <span class="o">=</span> <span class="sa">b</span><span class="s">"hello"</span>
<span class="n">pointer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">hello</span><span class="p">))</span>
<span class="k">assert</span> <span class="n">pointer</span>
<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">write</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">hello</span><span class="p">,</span> <span class="n">pointer</span><span class="p">)</span>  <span class="c1"># WRONG!
</span></code></pre></div></div>

<p>To make matters worse, wasmtime-py adds its own footgun: The <code class="language-plaintext highlighter-rouge">read</code> and
<code class="language-plaintext highlighter-rouge">write</code> methods adopt the questionable Python convention of negative
indices acting from the end. If <code class="language-plaintext highlighter-rouge">malloc</code> returns a pointer in the upper
half of memory, the negative pointer will pass the bounds check inside
<code class="language-plaintext highlighter-rouge">write</code> because negative is valid, then quietly store to the wrong
address! Doh!</p>

<p>I wondered how common this error, so I searched online. I could find only
one non-trivial wasmtime-py use in the wild, in a sandboxed PDF reader. It
falls into the negative pointer trap as I expected. Not only that, it’s <a href="https://github.com/paulocoutinhox/pdfium-lib/blob/139d5037/modules/wasm.py#L601-L606">a
buffer overflow into Python’s memory space</a>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            <span class="n">buf_ptr</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pdf_data</span><span class="p">))</span>
            <span class="n">mem_data</span> <span class="o">=</span> <span class="n">memory</span><span class="p">.</span><span class="n">data_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>

            <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">byte</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">pdf_data</span><span class="p">):</span>
                <span class="n">mem_data</span><span class="p">[</span><span class="n">buf_ptr</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">byte</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">data_ptr</code> method returns a non-bounds-checked raw <code class="language-plaintext highlighter-rouge">ctypes</code> pointer,
so this is actually a double mistake. First, it shouldn’t trust pointers
coming out of Wasm if it cares at all about sandboxing. The second is the
potential negative pointer, which in this case would write outside of the
Wasm memory and in Python’s memory, hopefully seg-faulting.</p>

<p>What’s one to do? <strong>Every pointer coming out of Wasm must be truncated</strong>
with a mask:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pointer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(...)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>   <span class="c1"># correct for wasm32!
</span></code></pre></div></div>

<p>This interprets the result as unsigned. 64-bit Wasm needs a 64-bit mask,
though in practice you will never get a valid negative pointer from 64-bit
Wasm. This rule applies to JavaScript as well, where the idiom is:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">pointer</span> <span class="o">=</span> <span class="nx">malloc</span><span class="p">(...)</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">0</span>
</code></pre></div></div>

<p>Wasm runtimes cannot help — they lack the necessary information — and this
is perhaps a fundamental flaw in Wasm’s design. Once you know about it you
see this mistake happening everywhere.</p>

<p>Now that you have a proper address, you can apply it to a buffer protocol
view of memory. If you’re using NumPy there are various ways to interact
with this memory by wrapping it in NumPy types, though only if you’re on a
little endian host. (If you’re on a big endian machine, just give up on
running Wasm anyway.) The first use case I have in mind typically involves
copying plain Python values in and out. The <a href="https://docs.python.org/3/library/struct.html"><code class="language-plaintext highlighter-rouge">struct</code> package</a> is
quite handy here:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vec2</span>   <span class="o">=</span> <span class="n">malloc</span><span class="p">(...)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>
<span class="n">memory</span> <span class="o">=</span> <span class="n">exports</span><span class="p">[</span><span class="s">"memory"</span><span class="p">].</span><span class="n">get_buffer_ptr</span><span class="p">(</span><span class="n">store</span><span class="p">)</span>
<span class="n">struct</span><span class="p">.</span><span class="n">pack_into</span><span class="p">(</span><span class="s">"&lt;ii"</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">vec2</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>

<p>It fills a similar role to <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/DataView">JavaScript <code class="language-plaintext highlighter-rouge">DataView</code></a>. If you’re copying
lots of numbers, with CPython it’s faster to construct a custom format
string rather than use a loop:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nums</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">...</span>
<span class="n">struct</span><span class="p">.</span><span class="n">pack_into</span><span class="p">(</span><span class="sa">f</span><span class="s">"&lt;</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">nums</span><span class="p">)</span><span class="si">}</span><span class="s">i"</span><span class="p">,</span> <span class="n">memory</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="o">*</span><span class="n">nums</span><span class="p">)</span>
</code></pre></div></div>

<p>To copy structures back out, use <code class="language-plaintext highlighter-rouge">struct.unpack_from</code>. If you’re moving
strings, you’ll need to <code class="language-plaintext highlighter-rouge">.encode()</code> and <code class="language-plaintext highlighter-rouge">.decode()</code> to convert to and from
<code class="language-plaintext highlighter-rouge">bytes</code>, which are well-suited to <code class="language-plaintext highlighter-rouge">read</code> and <code class="language-plaintext highlighter-rouge">write</code>.</p>

<p>In practice with real Wasm programs you’re going to be interacting with
the “guest” allocator from the outside, to request memory into which you
copy inputs for a function. In my examples I’ve used <code class="language-plaintext highlighter-rouge">malloc</code> because it
requires no elaboration, but as usual <a href="/blog/2023/09/27/">a bump allocator</a> solves
this so much better, especially because it doesn’t require stuffing a
whole general purpose allocator inside the Wasm program. Have one global
arena — no other threads will sharing that Wasm instance — rapid fire a
bunch of allocations as needed without any concern for memory management
in the “host”, call the function, which might allocate a result from that
arena, then reset the arena to clean up. In essence a stack for passing
values in and out.</p>

<h3 id="webassembly-as-faster-python">WebAssembly as faster Python</h3>

<p>Suppose we noticed a computational hot spot in our Python program in a
pure Python function (e.g. not calling out to an extension). Optimizing
this function would be wise. Based on my experiments if I re-implement
that function in C, compile it to Wasm, then run that bit of Wasm in place
of the original function, I can expect around a 10x speed-up. In general C
is more like 100x faster than Python, and the overhead of interfacing with
Wasm — copying stuff in and out, etc. — can be high, but not so high as to
not be profitable. This improves further if I can change the interface,
e.g. require callers to use the buffer protocol.</p>

<p>Thanks to wasmtime-py, I could introduce this change without fussing with
cross-compilers to build distribution binaries, nor require a toolchain on
the target, just a hefty Python package. Might be worth it.</p>

<p>My <a href="https://github.com/skeeto/scratch/tree/master/wasm-bench">main experimental benchmark</a> is a variation on <a href="/blog/2023/06/26/">my solution to
the “Two Sum” problem</a>, which I originally wrote for JavaScript, then
extended to pywasm3 and later wasmtime-py. It’s simple, just interesting
enough, and representative of the sort of Wasm drop-in I have in mind. It
has the same interface, but implements it with Wasm.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Original Pythonic interface
</span><span class="k">def</span> <span class="nf">twosum</span><span class="p">(</span><span class="n">nums</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">target</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
    <span class="p">...</span>

<span class="c1"># Stateful Wasm interface
</span><span class="k">class</span> <span class="nc">TwoSumWasm</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">store</span>    <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Store</span><span class="p">()</span>
        <span class="n">module</span>   <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Module</span><span class="p">.</span><span class="n">from_file</span><span class="p">(</span><span class="n">store</span><span class="p">.</span><span class="n">engine</span><span class="p">,</span> <span class="p">...)</span>
        <span class="n">instance</span> <span class="o">=</span> <span class="n">wasmtime</span><span class="p">.</span><span class="n">Instance</span><span class="p">(</span><span class="n">store</span><span class="p">,</span> <span class="n">module</span><span class="p">,</span> <span class="p">())</span>
        <span class="p">...</span>

    <span class="k">def</span> <span class="nf">twosum</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">nums</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
        <span class="c1"># ... use wasm instance ...
</span></code></pre></div></div>

<p>There’s some state to it with the Wasm instance in tow. If you hide that
by making it global you’ll need to synchronize your threads around it. In
a multi-threaded program perhaps these would be lazily-constructed thread
locals. I haven’t had to solve this yet.</p>

<p>However, the weakness of the wasmtime “store” really shows: Notice how
compilation and instantiation are bound together in one store? <del>I cannot
compile once and then create disposable instances on the fly</del>, e.g. as
required for each run of a WASI program. Every instance permanently
extends the compilation store. In practice we must wastefully re-compile
the Wasm program for each disposable instance. Despite appearances,
compilation and instantiation are not actually distinct steps, as they are
in JavaScript’s Wasm API. <code class="language-plaintext highlighter-rouge">wasmtime.Instance</code> accepts a store as its first
argument, <em>suggesting</em> use of a different store for instantiation. That
would solve this problem, but as of this writing it <em>must</em> be the same
store used to compile the module. <del>This is a fatal flaw for certain real
use cases, particularly WASI.</del></p>

<p><strong>Update</strong>: Wolfgang Meier points out the <code class="language-plaintext highlighter-rouge">serialize</code> and <code class="language-plaintext highlighter-rouge">deserialize</code>
methods, which detaches a compiled module from its store, allowing for
independent instantations. I tried it, and it’s a practical workaround.
Overhead is low; no validation when deserializing. My benchmark now does
it for future reference, as I expect it to be my typical use case.</p>

<h3 id="webassembly-as-embedded-capabilities">WebAssembly as embedded capabilities</h3>

<p>Loup Vaillant’s <a href="https://monocypher.org/">Monocypher</a> is a wonderful cryptography library.
Lean, efficient, and embedding-friendly, so much so it’s distributed in
amalgamated form. It requires no libc or runtime, so we can compile it
straight to Wasm with almost any Clang toolchain:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -nostdlib -O2 -Wl,--no-entry -Wl,--export-all
        -o monocypher.wasm monocypher.c
</code></pre></div></div>

<p>It’s not “Wasm-aware” so I need <code class="language-plaintext highlighter-rouge">--export-all</code> to expose the interface.
This is swell because, as single translation unit, anything with external
linkage is the interface. Though remember what I said about interacting
with the guest allocator? This has no allocator, nor should it. It’s not
so usable in this form because we’d need to manage memory from the
outside. Do-able, but it’s easy to improve by adding a couple more
functions, sticking to a single translation unit:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"monocypher.c"</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">char</span>  <span class="n">__heap_base</span><span class="p">[];</span>
<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">heap_used</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">heap_high</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">bump_alloc</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">bump_reset</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">heap_used</span> <span class="o">-</span> <span class="n">__heap_base</span><span class="p">;</span>
    <span class="n">__builtin_memset</span><span class="p">(</span><span class="n">__heap_base</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>  <span class="c1">// wipe keys, etc.</span>
    <span class="n">heap_used</span> <span class="o">=</span> <span class="n">__heap_base</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve <a href="/blog/2025/04/19/">discussed <code class="language-plaintext highlighter-rouge">__heap_base</code> before</a>, which is part of the ABI.
We’ll push keys, inputs, etc. onto this “stack”, run our cryptography
routine, copy out the result, then reset the bump allocator, which wipes
out all sensitive data. Often <code class="language-plaintext highlighter-rouge">memset</code> is insufficient — typically it’s
zero-then-free, and compilers see the <a href="/blog/2025/09/30/">lifetime</a> about to end — but no
lifetime ends here, and stores to this “heap” memory externally observable
as far as the abstract machine can tell. (Otherwise we couldn’t reliably
copy out our results!)</p>

<p>There’s a lot to this API, but I’m only going to look at <a href="https://monocypher.org/manual/aead">the AEAD
interface</a>. We “lock” up some data in an encrypted box, write any
unencrypted label we’d like on the outside. Then later we can unlock the
box, which will only open for us if neither the contents of the box nor
the label were tampered with. That’s some solid API design:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">crypto_aead_lock</span><span class="p">(</span><span class="kt">uint8_t</span>       <span class="o">*</span><span class="n">cipher_text</span><span class="p">,</span>
                      <span class="kt">uint8_t</span>        <span class="n">mac</span>  <span class="p">[</span><span class="mi">16</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">key</span>  <span class="p">[</span><span class="mi">32</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">nonce</span><span class="p">[</span><span class="mi">24</span><span class="p">],</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">ad</span><span class="p">,</span>         <span class="kt">size_t</span> <span class="n">ad_size</span><span class="p">,</span>
                      <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">plain_text</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">text_size</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">crypto_aead_unlock</span><span class="p">(</span><span class="kt">uint8_t</span>       <span class="o">*</span><span class="n">plain_text</span><span class="p">,</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">mac</span>  <span class="p">[</span><span class="mi">16</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">key</span>  <span class="p">[</span><span class="mi">32</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span>  <span class="n">nonce</span><span class="p">[</span><span class="mi">24</span><span class="p">],</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">ad</span><span class="p">,</span>          <span class="kt">size_t</span> <span class="n">ad_size</span><span class="p">,</span>
                       <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">cipher_text</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">text_size</span><span class="p">);</span>
</code></pre></div></div>

<p>By compiling to Wasm we can access this functionality from Python almost
like it was pure Python, and interact with other systems using Monocypher.</p>

<p>Since Monocypher does not interact with the outside world on its own, it
relies on callers to use their system’s CSPRNG to create those nonces and
keys, which we’ll do using <a href="https://docs.python.org/3/library/secrets.html">the <code class="language-plaintext highlighter-rouge">secrets</code> built-in package</a>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Monocypher</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="p">...</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_read</span>   <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">memory</span><span class="p">.</span><span class="n">read</span><span class="p">,</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_write</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">memory</span><span class="p">.</span><span class="n">write</span><span class="p">,</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">__alloc</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"bump_alloc"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span>  <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"bump_reset"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span>   <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"crypto_aead_lock"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_unlock</span> <span class="o">=</span> <span class="n">functools</span><span class="p">.</span><span class="n">partial</span><span class="p">(</span><span class="n">exports</span><span class="p">[</span><span class="s">"crypto_aead_unlock"</span><span class="p">],</span> <span class="n">store</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span> <span class="o">=</span> <span class="n">secrets</span><span class="p">.</span><span class="n">SystemRandom</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_alloc</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__alloc</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffffff</span>

    <span class="k">def</span> <span class="nf">generate_key</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span><span class="p">.</span><span class="n">randbytes</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">generate_nonce</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_csprng</span><span class="p">.</span><span class="n">randbytes</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>

    <span class="p">...</span>
</code></pre></div></div>

<p>With a solid foundation, all that follows comes easily. A <code class="language-plaintext highlighter-rouge">finally</code>
guarantees secrets are always removed from Wasm memory, and the rest is
just about copying bytes around:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">aead_lock</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">ad</span> <span class="o">=</span> <span class="sa">b</span><span class="s">""</span><span class="p">):</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">==</span> <span class="mi">32</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">macptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
            <span class="n">keyptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">nonceptr</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>
            <span class="n">adptr</span>    <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">))</span>
            <span class="n">textptr</span>  <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">keyptr</span><span class="p">)</span>
            <span class="n">nonce</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate_nonce</span><span class="p">()</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">nonce</span><span class="p">,</span> <span class="n">nonceptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">ad</span><span class="p">,</span>    <span class="n">adptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">text</span><span class="p">,</span>  <span class="n">textptr</span><span class="p">)</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_lock</span><span class="p">(</span>
                <span class="n">textptr</span><span class="p">,</span>
                <span class="n">macptr</span><span class="p">,</span>
                <span class="n">keyptr</span><span class="p">,</span>
                <span class="n">nonceptr</span><span class="p">,</span>
                <span class="n">adptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">),</span>
                <span class="n">textptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span>
            <span class="p">)</span>
            <span class="k">return</span> <span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">macptr</span><span class="p">,</span> <span class="n">macptr</span><span class="o">+</span><span class="mi">16</span><span class="p">),</span>
                <span class="n">nonce</span><span class="p">,</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">textptr</span><span class="p">,</span> <span class="n">textptr</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)),</span>
            <span class="p">)</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span><span class="p">()</span>
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">aead_unlock</code> is basically the same in reverse, but throws if the box
fails to unlock, perhaps due to tampering:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">aead_unlock</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">mac</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">nonce</span><span class="p">,</span> <span class="n">ad</span> <span class="o">=</span> <span class="sa">b</span><span class="s">""</span><span class="p">):</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">mac</span><span class="p">)</span> <span class="o">==</span> <span class="mi">16</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">==</span> <span class="mi">32</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">nonce</span><span class="p">)</span> <span class="o">==</span> <span class="mi">24</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">macptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
            <span class="n">keyptr</span>   <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">nonceptr</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="mi">24</span><span class="p">)</span>
            <span class="n">adptr</span>    <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">))</span>
            <span class="n">textptr</span>  <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_alloc</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">mac</span><span class="p">,</span> <span class="n">macptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">keyptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">nonce</span><span class="p">,</span> <span class="n">nonceptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">ad</span><span class="p">,</span> <span class="n">adptr</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_write</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">textptr</span><span class="p">)</span>

            <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">_unlock</span><span class="p">(</span>
                <span class="n">textptr</span><span class="p">,</span>
                <span class="n">macptr</span><span class="p">,</span>
                <span class="n">keyptr</span><span class="p">,</span>
                <span class="n">nonceptr</span><span class="p">,</span>
                <span class="n">adptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ad</span><span class="p">),</span>
                <span class="n">textptr</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span>
            <span class="p">):</span>
                <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"AEAD mismatch"</span><span class="p">)</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_read</span><span class="p">(</span><span class="n">textptr</span><span class="p">,</span> <span class="n">textptr</span><span class="o">+</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_reset</span><span class="p">()</span>
</code></pre></div></div>

<p>Usage:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mc</span> <span class="o">=</span> <span class="n">Monocypher</span><span class="p">()</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">generate_key</span><span class="p">()</span>
<span class="n">message</span> <span class="o">=</span> <span class="s">"Hello, world!"</span>
<span class="n">mac</span><span class="p">,</span> <span class="n">nonce</span><span class="p">,</span> <span class="n">encrypted</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">aead_lock</span><span class="p">(</span><span class="n">message</span><span class="p">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">key</span><span class="p">)</span>
</code></pre></div></div>

<p>Transmit <code class="language-plaintext highlighter-rouge">mac</code>, <code class="language-plaintext highlighter-rouge">nonce</code>, and <code class="language-plaintext highlighter-rouge">encrypted</code> to the other party (or your
future self), who already has the <code class="language-plaintext highlighter-rouge">key</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">decrypted</span> <span class="o">=</span> <span class="n">mc</span><span class="p">.</span><span class="n">aead_unlock</span><span class="p">(</span><span class="n">encrypted</span><span class="p">,</span> <span class="n">mac</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">nonce</span><span class="p">)</span>
</code></pre></div></div>

<p>Find the <strong>complete source <a href="https://github.com/skeeto/scratch/tree/master/wasm-monocypher">in my scratch repository</a></strong>.</p>

<p>While I have a few reservations about wasmtime-py, it fascinates me how
well this all works. It’s been my hammer in search of a nail for some time
now.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Freestyle linked lists tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/31/"/>
    <id>urn:uuid:355dfc03-0e7c-4bae-92fe-5b52174de325</id>
    <updated>2025-12-31T11:59:59Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>Linked lists are a data structure basic building block, with especially
flexible allocation behavior. They’re not just a useful starting point,
but sometimes a sound foundation for future growth. I’m going to start
with the beginner stuff, then <em>without disrupting the original linked
list</em>, enhance it with new capabilities.</p>

<h3 id="linked-list-basics">Linked list basics</h3>

<p>For the sake of an interesting example, I’m will demonstrate with the same
concept as <a href="/blog/2025/01/19/">last time I talked about data structures</a>: a collection
of key/value strings, in the form of an environment variables. This time
in linked list form:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="kt">uint64_t</span> <span class="nf">hash64</span><span class="p">(</span><span class="n">Str</span><span class="p">);</span>
<span class="n">bool</span>     <span class="nf">equals</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="n">Str</span><span class="p">);</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">Env</span> <span class="n">Env</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Env</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>It will be sourced from some string, formatted like the <code class="language-plaintext highlighter-rouge">env</code> program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Str</span> <span class="n">input</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"EDITOR=vim</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"HOME=/home/user</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"PATH=/bin:/usr/bin</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"SHELL=/bin/bash</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"TERM=xterm-256color</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"USER=user</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"SHELL=/bin/sh</span><span class="se">\n</span><span class="s">"</span>   <span class="c1">// &lt;- repeated entry</span>
    <span class="p">);</span>
</code></pre></div></div>

<p>And all the parser heavy lifting will be done by <a href="/blog/2025/03/02/">our ever-handy <code class="language-plaintext highlighter-rouge">cut</code>
function</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">tail</span><span class="p">;</span>
    <span class="n">Str</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Cut</span><span class="p">;</span>

<span class="n">Cut</span> <span class="nf">cut</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
</code></pre></div></div>

<p>The simplest way to build up a linked list is like a stack, pushing
objects into the front. Zero-initialized <code class="language-plaintext highlighter-rouge">head</code> pointer, point the new
node at it, then make that node the new <code class="language-plaintext highlighter-rouge">head</code> element:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_reversed</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// 1</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="n">line</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="n">Cut</span>  <span class="n">pair</span>  <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="sc">'='</span><span class="p">);</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>   <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Env</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span>   <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">;</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span>  <span class="o">=</span> <span class="n">head</span><span class="p">;</span>  <span class="c1">// 2</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>  <span class="c1">// 3</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s it, a complete linked list implementation in three lines of code.
No big deal. Because of the bump allocator, nodes are packed in order in
memory, so the usual cache objections for linked lists do not apply. LIFO
semantics mean the linked list is in reverse order from the source order.
If we’re doing a linear scan through the linked list, the last entry in
the source wins, which may be what you wanted:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_linear</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>

    <span class="c1">// ...</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>  <span class="o">=</span> <span class="n">parse_reversed</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">lookup_linear</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));</span>  <span class="c1">// &lt;- "/bin/sh"</span>
</code></pre></div></div>

<p>It’s just one more line of code to maintain the original order, using a
very simple double-pointer technique:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_ordered</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span>  <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// 1</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>  <span class="c1">// 2</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>  <span class="c1">// 3</span>
        <span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>  <span class="c1">// 4</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>No branches necessary, nor dummy nodes. A pointer to the last pointer in
the list works even for empty lists. The <code class="language-plaintext highlighter-rouge">tail</code> pointer is unneeded once
the list is complete. This form has queue behavior.</p>

<h3 id="faster-look-up-with-a-tree">Faster look-up with a tree</h3>

<p>If you’re doing many look-ups, or if the list is long, those linear scans
to find items in the list are not ideal. We can introduce an intrusive
hash map, in the form of <a href="/blog/2023/09/30/">a hash trie</a>, by adding two more pointers
to the linked list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Env</span> <span class="n">Env</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Env</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>  <span class="c1">// &lt;- hash map linkage</span>
    <span class="n">Str</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I’ve found it’s simplest to construct a node into the hash map, then link
it onto the list tail. That constructor looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">new_env</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Env</span> <span class="o">**</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">,</span> <span class="n">Str</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">env</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Env</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then we swap that into the <code class="language-plaintext highlighter-rouge">head</code>/<code class="language-plaintext highlighter-rouge">tail</code> version in place of the original
<code class="language-plaintext highlighter-rouge">new</code> macro call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Env</span> <span class="o">*</span><span class="nf">parse_mapped</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Env</span>  <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">line</span> <span class="o">=</span> <span class="p">{</span><span class="n">s</span><span class="p">};</span> <span class="n">line</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new_env</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">,</span> <span class="n">pair</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">);</span>
        <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>
        <span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is now a linked list and a hash map at the same time, built-up piece
by piece without any resizing. We still have the original linked list, but
we can now search it in log time. The look-up function resembles the
constructor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_logn</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">env</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because of the FIFO semantics, it finds the first match in the source:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span>   <span class="o">=</span> <span class="n">parse_mapped</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">Str</span>  <span class="n">value</span> <span class="o">=</span> <span class="n">lookup_logn</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));</span>  <span class="c1">// &lt;- /bin/bash</span>
</code></pre></div></div>

<p>The other matches are also in the tree, and we can find those as well by
continuing traversal. That is, it’s already a multi-map. This particular
interface can’t pick up where it left off, but we can build one that does
using an iterator/cursor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span><span class="p">;</span>
    <span class="n">Str</span>      <span class="n">key</span><span class="p">;</span>
    <span class="n">Env</span>     <span class="o">*</span><span class="n">env</span><span class="p">;</span>
<span class="p">}</span> <span class="n">EnvIter</span><span class="p">;</span>

<span class="n">EnvIter</span> <span class="nf">new_enviter</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">EnvIter</span><span class="p">){</span><span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">),</span> <span class="n">key</span><span class="p">,</span> <span class="n">env</span><span class="p">};</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">enviter_next</span><span class="p">(</span><span class="n">EnvIter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">cur</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">63</span><span class="p">];</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">hash</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">cur</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">cur</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Update</strong>: Thanks to <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CSJ2PR12MB79208563F4485DCAA27D5776A2BAA@SJ2PR12MB7920.namprd12.prod.outlook.com%3E?__goaway_challenge=meta-refresh&amp;__goaway_id=5902363e020028d0488062799debf13b&amp;__goaway_referer=https%3A%2F%2Flists.sr.ht%2F~skeeto%2Fpublic-inbox">Daniel Kareh for a correction</a>.</p>

<p>Then we can use a loop to visit every match in source order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">parse_mapped</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">EnvIter</span> <span class="n">it</span> <span class="o">=</span> <span class="n">new_enviter</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));;)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">enviter_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">value</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<h3 id="faster-look-up-with-an-index-table">Faster look-up with an index table</h3>

<p>If the list is static once constructed, or if look-ups happen much more
frequently than the list grows, we can find list items even faster by
constructing an index table over the list: <a href="/blog/2022/08/08/">an MSI hash table</a>. This
table avoids redundancy by <em>sharing structure with the list</em>. Because it’s
a flat table, if we keep adding to the list then eventually we’ll need to
reconstruct a larger table when it becomes overloaded.</p>

<p>The table itself has a very simple structure, just an array and its size,
expressed as a power-of-two exponent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">**</span><span class="n">slots</span><span class="p">;</span>
    <span class="kt">int</span>   <span class="n">exp</span><span class="p">;</span>
<span class="p">}</span> <span class="n">EnvTable</span><span class="p">;</span>
</code></pre></div></div>

<p>We do not need the <code class="language-plaintext highlighter-rouge">child</code> nodes, and so linked list nodes are untouched.
That is, it’s not intrusive. In fact, we can build any arbitrary number of
tables over a list, perhaps indexing different properties for different
sorts of queries. The idea is that we build the list first, then create
the table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">EnvTable</span> <span class="nf">new_table</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Compute list length</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">len</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Then compute an appropriate table size</span>
    <span class="n">EnvTable</span> <span class="n">table</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">table</span><span class="p">.</span><span class="n">exp</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">one</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="p">(</span><span class="n">one</span><span class="o">&lt;&lt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="n">one</span><span class="o">&lt;&lt;</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="o">-</span><span class="mi">3</span><span class="p">))</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="o">++</span><span class="p">)</span> <span class="p">{}</span>
    <span class="n">table</span><span class="p">.</span><span class="n">slots</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">one</span><span class="o">&lt;&lt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">Env</span> <span class="o">*</span><span class="p">);</span>

    <span class="c1">// Then insert linked list items into the table</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="n">var</span><span class="p">;</span> <span class="n">var</span> <span class="o">=</span> <span class="n">var</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">var</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">);</span>
        <span class="kt">size_t</span>   <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">var</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">table</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how only searches for an empty slot, not for a matching entry. That’s
because this too is a multi-map, also with elements in insertion order.
Look-ups are constant time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">lookup_constant</span><span class="p">(</span><span class="n">EnvTable</span> <span class="n">table</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
    <span class="kt">size_t</span>   <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It finds the earliest match in the list, meaning an index over the
“reverse” list will find the last entry in the source. The indexed-over
property is the input to <code class="language-plaintext highlighter-rouge">hash64</code> and <code class="language-plaintext highlighter-rouge">equals</code>. By using a different input
to these functions we could build another table on, say, value length if
that’s a property on which we needed to find elements efficiently. Again,
for multi-map iteration we need some kind of iterator or cursor:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">EnvTable</span> <span class="n">table</span><span class="p">;</span>
    <span class="n">Str</span>      <span class="n">key</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">step</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">i</span><span class="p">;</span>
<span class="p">}</span> <span class="n">TableIter</span><span class="p">;</span>

<span class="n">TableIter</span> <span class="nf">new_tableiter</span><span class="p">(</span><span class="n">EnvTable</span> <span class="n">table</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
    <span class="kt">size_t</span>   <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span>   <span class="n">idx</span>  <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">hash</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">TableIter</span><span class="p">){</span><span class="n">table</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">step</span><span class="p">,</span> <span class="n">idx</span><span class="p">};</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">table_next</span><span class="p">(</span><span class="n">TableIter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">mask</span>  <span class="o">=</span> <span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">.</span><span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">Env</span>  <span class="o">**</span><span class="n">slots</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">.</span><span class="n">slots</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">+</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="p">(</span><span class="n">Str</span><span class="p">){};</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">slots</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">i</span><span class="p">]</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Its usage looks just like the other multi-map:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">parse_ordered</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">EnvTable</span> <span class="n">table</span> <span class="o">=</span> <span class="n">new_table</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">TableIter</span> <span class="n">it</span> <span class="o">=</span> <span class="n">new_tableiter</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"SHELL"</span><span class="p">));;)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">table_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">value</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>With these techniques at hand, I can start with linked lists when they are
convenient, and later add needed features without fundamentally changing
the underlying data structure. None of this requires runtime support, and
so it fits comfortably on embedded systems, tiny WebAssembly programs,
etc.  All the above code is available ready to run: <a href="https://gist.github.com/skeeto/493823d5956dfdc1d95d8c390c2b0e1d"><code class="language-plaintext highlighter-rouge">list.c</code></a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Unix "find" expressions compiled to bytecode</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/23/"/>
    <id>urn:uuid:bbe2671b-378d-40b1-9564-c3a3b798dfb4</id>
    <updated>2025-12-23T04:20:22Z</updated>
    <category term="c"/><category term="compsci"/>
    <content type="html">
      <![CDATA[<p>In preparation for a future project, I was thinking about at the <a href="https://pubs.opengroup.org/onlinepubs/9799919799/utilities/find.html">unix
<code class="language-plaintext highlighter-rouge">find</code> utility</a>. It operates a file system hierarchies, with basic
operations selected and filtered using a specialized expression language.
Users compose operations using unary and binary operators, grouping with
parentheses for precedence. <code class="language-plaintext highlighter-rouge">find</code> may apply the expression to a great
many files, so compiling it into a bytecode, resolving as much as possible
ahead of time, and minimizing the per-element work, seems like a prudent
implementation strategy. With some thought, I worked out a technique to do
so, which was simpler than I expected, and I’m pleased with the results. I
was later surprised all the real world <code class="language-plaintext highlighter-rouge">find</code> implementations I examined
use <a href="https://craftinginterpreters.com/a-tree-walk-interpreter.html">tree-walk interpreters</a> instead. This article describes how my
compiler works, with a runnable example, and lists ideas for improvements.</p>

<p>For a quick overview, the syntax looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find [-H|-L] path... [expression...]
</code></pre></div></div>

<p>Technically at least one path is required, but most implementations imply
<code class="language-plaintext highlighter-rouge">.</code> when none are provided. If no expression is supplied, the default is
<code class="language-plaintext highlighter-rouge">-print</code>, e.g. print everything under each listed path. This prints the
whole tree, including directories, under the current directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find .
</code></pre></div></div>

<p>To only print files, we could use <code class="language-plaintext highlighter-rouge">-type f</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -a -print
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">-a</code> is the logical AND binary operator. <code class="language-plaintext highlighter-rouge">-print</code> always evaluates
to true. It’s never necessary to write <code class="language-plaintext highlighter-rouge">-a</code>, and adjacent operations are
implicitly joined with <code class="language-plaintext highlighter-rouge">-a</code>. We can keep chaining them, such as finding
all executable files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -executable -print
</code></pre></div></div>

<p>If no <code class="language-plaintext highlighter-rouge">-exec</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, or <code class="language-plaintext highlighter-rouge">-print</code> (or similar side-effect extensions like
<code class="language-plaintext highlighter-rouge">-print0</code> or <code class="language-plaintext highlighter-rouge">-delete</code>) are present, the whole expression is wrapped in an
implicit <code class="language-plaintext highlighter-rouge">( expr ) -print</code>. So we could also write this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f -executable
</code></pre></div></div>

<p>Use <code class="language-plaintext highlighter-rouge">-o</code> for logical OR. To print all files with the executable bit <em>or</em>
with a <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f \( -executable -o -name '*.exe' \)
</code></pre></div></div>

<p>I needed parentheses because <code class="language-plaintext highlighter-rouge">-o</code> has lower precedence than <code class="language-plaintext highlighter-rouge">-a</code>, and
because parentheses are shell metacharacters I also needed to escape them
for the shell. It’s a shame <code class="language-plaintext highlighter-rouge">find</code> didn’t use <code class="language-plaintext highlighter-rouge">[</code> and <code class="language-plaintext highlighter-rouge">]</code> instead! There’s
also a unary logical NOT operator, <code class="language-plaintext highlighter-rouge">!</code>. To print all non-executable files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -type f ! -executable
</code></pre></div></div>

<p>Binary operators are short-circuiting, so this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find -type d -a -exec du -sh {} +
</code></pre></div></div>

<p>Only lists the sizes of directories, as the <code class="language-plaintext highlighter-rouge">-type d</code> fails causing the
whole expression to evaluate to false without evaluating <code class="language-plaintext highlighter-rouge">-exec</code>. Or
equivalently with <code class="language-plaintext highlighter-rouge">-o</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find ! -type d -o -exec du -sh {} +
</code></pre></div></div>

<p>If it’s not a directory then the left-hand side evaluates to true, and the
right-hand side is not evaluated. All three implementations I examined
(GNU, BSD, BusyBox) have a <code class="language-plaintext highlighter-rouge">-regex</code> extension, and eagerly compile the
regular expression even if the operation is never evaluated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find . -print -o -regex [
find: bad regex '[': Invalid regular expression
</code></pre></div></div>

<p>I was surprised by this because it doesn’t seem to be in the spirit of the
original utility (“The second expression shall not be evaluated if the
first expression is true.”), and I’m used to the idea of short-circuit
validation for the right-hand side of a logical expression. Recompiling
for each evaluation would be unwise, but it could happen lazily such that
an invalid regular expression only causes an error if it’s actually used.
No big deal, just a curiosity.</p>

<h3 id="bytecode-design">Bytecode design</h3>

<p>A bytecode interpreter needs to track just one result at a time, making it
a single register machine, with a 1-bit register at that. I came up with
these five opcodes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>halt
not
braf   LABEL
brat   LABEL
action NAME [ARGS...]
</code></pre></div></div>

<p>Obviously <code class="language-plaintext highlighter-rouge">halt</code> stops the program. While I could just let it “run off the
end” it’s useful to have an actual instruction so that I can attach a
label and jump to it. The <code class="language-plaintext highlighter-rouge">not</code> opcode negates the register. <code class="language-plaintext highlighter-rouge">braf</code> is
“branch if false”, jumping (via relative immediate) to the labeled (in
printed form) instruction if the register is false. <code class="language-plaintext highlighter-rouge">brat</code> is “branch if
true”. Together they implement the <code class="language-plaintext highlighter-rouge">-a</code> and <code class="language-plaintext highlighter-rouge">-o</code> operators. In practice
there are no loops and jumps are always forward: <code class="language-plaintext highlighter-rouge">find</code> is <a href="/blog/2016/04/30/">not Turing
complete</a>.</p>

<p>In a real implementation each possible action (<code class="language-plaintext highlighter-rouge">-name</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, <code class="language-plaintext highlighter-rouge">-print</code>,
<code class="language-plaintext highlighter-rouge">-type</code>, etc.) would get a dedicated opcode. This requires implementing
each operator, at least in part, in order to correctly parse the whole
<code class="language-plaintext highlighter-rouge">find</code> expression. For now I’m just focused on the bytecode compiler, so
this opcode is a stand-in, and it kind of pretends based on looks. Each
action sets the register, and actions like <code class="language-plaintext highlighter-rouge">-print</code> always set it to true.
My compiler is <a href="https://github.com/skeeto/scratch/blob/c142e729/parsers/findc.c">called <strong><code class="language-plaintext highlighter-rouge">findc</code> (“find compiler”)</strong></a>.</p>

<p><strong>Update</strong>: Or try <a href="https://nullprogram.com/scratch/findc/">the <strong>online demo</strong></a> via Wasm! This version
includes a <a href="https://github.com/skeeto/scratch/commit/2c0a4b8f">peephole optimizer</a> I wrote after publishing this
article.</p>

<p>I assume readers of this program are familiar with <a href="/blog/2025/01/19/"><code class="language-plaintext highlighter-rouge">push</code> macro</a>
and <a href="/blog/2025/06/26/"><code class="language-plaintext highlighter-rouge">Slice</code> macro</a>. Because of the latter it requires a very
recent C compiler, like GCC 15 (e.g. via <a href="https://github.com/skeeto/w64devkit">w64devkit</a>) or Clang 22. Try
out some <code class="language-plaintext highlighter-rouge">find</code> commands and see how they appear as bytecode. The simplest
case is also optimal:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc
// path: .
        action  -print
        halt
</code></pre></div></div>

<p>Print the path then halt. Simple. Stepping it up:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f -executable
// path: .
        action  -type f
        braf    L1
        action  -executable
L1:     braf    L2
        action  -print
L2:     halt
</code></pre></div></div>

<p>If the path is not a file, it skips over the rest of the program by way of
the second branch instruction. It’s correct, but already we can see room
for improvement. This would be better:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        action  -type f
        braf    L1
        action  -executable
        braf    L1
        action  -print
L1:     halt
</code></pre></div></div>

<p>More complex still:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L1
        action  -name *.exe
L1:     braf    L2
        action  -print
L2:     halt
</code></pre></div></div>

<p>Inside the parentheses, if <code class="language-plaintext highlighter-rouge">-executable</code> succeeds, the right-hand side is
skipped. Though the <code class="language-plaintext highlighter-rouge">brat</code> jumps straight to a <code class="language-plaintext highlighter-rouge">braf</code>. It would be better
to jump ahead one more instruction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        action  -type f
        braf    L2
        action  -executable
        brat    L1
        action  -name *.exe
        braf    L2
L1      action  -print
L2:     halt
</code></pre></div></div>

<p>Silly things aren’t optimized either:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc ! ! -executable
// path: .
        action  -executable
        not
        not
        braf    L1
        action  -print
L1:     halt
</code></pre></div></div>

<p>Two <code class="language-plaintext highlighter-rouge">not</code> in a row cancel out, and so these instructions could be
eliminated. Overall this compiler could benefit from a <a href="https://en.wikipedia.org/wiki/Peephole_optimization">peephole
optimizer</a>, scanning over the program repeatedly, making small
improvements until no more can be made:</p>

<ul>
  <li>Delete <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">not</code>.</li>
  <li>A <code class="language-plaintext highlighter-rouge">brat</code> to a <code class="language-plaintext highlighter-rouge">braf</code> re-targets ahead one instruction, and vice versa.</li>
  <li>Jumping onto an identical jump adopts its target for itself.</li>
  <li>A <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">braf</code> might convert to a <code class="language-plaintext highlighter-rouge">brat</code>, and vice versa.</li>
  <li>Delete side-effect-free instructions before <code class="language-plaintext highlighter-rouge">halt</code> (e.g. <code class="language-plaintext highlighter-rouge">not</code>-<code class="language-plaintext highlighter-rouge">halt</code>).</li>
  <li>Exploit always-true actions, e.g. <code class="language-plaintext highlighter-rouge">-print</code>-<code class="language-plaintext highlighter-rouge">braf</code> can drop the branch.</li>
</ul>

<p>Writing a bunch of peephole pattern matchers sounds kind of fun. Though my
compiler would first need a slightly richer representation in order to
detect and fix up changes to branches. One more for the road:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ findc -type f ! \( -executable -o -name '*.exe' \)
// path: .
        action  -type f
        braf    L1
        action  -executable
        brat    L2
        action  -name *.exe
L2:     not
L1:     braf    L3
        action  -print
L3:     halt
</code></pre></div></div>

<p>The unoptimal jumps hint at my compiler’s structure. If you’re feeling up
for a challenge, pause here to consider how you’d build this compiler, and
how it might produce these particular artifacts.</p>

<h3 id="parsing-and-compiling">Parsing and compiling</h3>

<p>Before I even considered the shape of the bytecode I knew I needed to
convert <code class="language-plaintext highlighter-rouge">find</code> infix into a compiler-friendly postfix. That is, this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-type f -a ! ( -executable -o -name *.exe )
</code></pre></div></div>

<p>Becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-type f -executable -name *.exe -o ! -a
</code></pre></div></div>

<p>Which, importantly, erases the parentheses. This comes in as an <code class="language-plaintext highlighter-rouge">argv</code>
array, so it’s already tokenized for us by the shell <a href="/blog/2022/02/18/">or runtime</a>. The
classic <a href="https://en.wikipedia.org/wiki/Shunting_yard_algorithm">shunting-yard algorithm</a> solves this problem easily enough.
We have an output queue that goes into the compiler, and a token stack for
tracking <code class="language-plaintext highlighter-rouge">-a</code>, <code class="language-plaintext highlighter-rouge">-o</code>, <code class="language-plaintext highlighter-rouge">!</code>, and <code class="language-plaintext highlighter-rouge">(</code>. Then we walk <code class="language-plaintext highlighter-rouge">argv</code> in order:</p>

<ul>
  <li>
    <p>Actions go straight into the output queue.</p>
  </li>
  <li>
    <p>If we see one of the special stack tokens we push it onto the stack,
first popping operators with greater precedence into the queue, stopping
at <code class="language-plaintext highlighter-rouge">(</code>.</p>
  </li>
  <li>
    <p>If we see <code class="language-plaintext highlighter-rouge">)</code> we pop the stack into the output queue until we see <code class="language-plaintext highlighter-rouge">(</code>.</p>
  </li>
</ul>

<p>When we’re out of tokens, pop the remaining stack into the queue. My
parser synthesizes <code class="language-plaintext highlighter-rouge">-a</code> where it’s implied, so the compiler always sees
logical AND. If the expression contains no <code class="language-plaintext highlighter-rouge">-exec</code>, <code class="language-plaintext highlighter-rouge">-ok</code>, or <code class="language-plaintext highlighter-rouge">-print</code>,
after processing is complete the parser puts <code class="language-plaintext highlighter-rouge">-print</code> then <code class="language-plaintext highlighter-rouge">-a</code> into the
queue, which effectively wraps the whole expression in <code class="language-plaintext highlighter-rouge">( expr ) -print</code>.
By clearing the stack first, the real expression is effectively wrapped in
parentheses, so no parenthesis tokens need to be synthesized.</p>

<p>I’ve used the shunting-yard algorithm many times before, so this part was
easy. The new part was coming up with an algorithm to convert a series of
postfix tokens into bytecode. My solution is the compiler <strong>maintains a
stack of bytecode fragments</strong>. That is, each stack element is a sequence
of one or more bytecode instructions. Branches use relative addresses, so
they’re position-independent, and I can concatenate code fragments without
any branch fix-ups. It takes the following actions from queue tokens:</p>

<ul>
  <li>
    <p>For an action token, create an <code class="language-plaintext highlighter-rouge">action</code> instruction, and push it onto
the fragment stack as a new fragment.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">!</code> token, pop the top fragment, append a <code class="language-plaintext highlighter-rouge">not</code> instruction, and
push it back onto the stack.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">-a</code> token, pop the top two fragments, join then with a <code class="language-plaintext highlighter-rouge">braf</code> in
the middle which jumps just beyond the second fragment. That is, if the
first fragment evaluates to false, skip over the second fragment into
whatever follows.</p>
  </li>
  <li>
    <p>For a <code class="language-plaintext highlighter-rouge">-o</code> token, just like <code class="language-plaintext highlighter-rouge">-a</code> but use <code class="language-plaintext highlighter-rouge">brat</code>. If the first fragment
is true, we skip over the second fragment.</p>
  </li>
</ul>

<p>If the expression is valid, at the end of this process the stack contains
exactly one fragment. Append a <code class="language-plaintext highlighter-rouge">halt</code> instruction to this fragment, and
that’s our program! If the final fragment contained a branch just beyond
its end, this <code class="language-plaintext highlighter-rouge">halt</code> is that branch target. A few peephole optimizations
and could probably be an optimal program for this instruction set.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Closures as Win32 window procedures</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/12/12/"/>
    <id>urn:uuid:7bf46ec6-a8b2-4ffa-857a-86c040357702</id>
    <updated>2025-12-12T19:52:10Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Back in 2017 I wrote <a href="/blog/2017/01/08/">about a technique for creating closures in C</a>
using <a href="/blog/2015/03/19/">JIT-compiled</a> wrapper. It’s neat, though rarely necessary in
real programs, so I don’t think about it often. I applied it to <code class="language-plaintext highlighter-rouge">qsort</code>,
which <a href="/blog/2023/02/11/">sadly</a> accepts no context pointer. More practical would be
working around <a href="/blog/2023/12/17/">insufficient custom allocator interfaces</a>, to
create allocation functions at run-time bound to a particular allocation
region. I’ve learned a lot since I last wrote about this subject, and <a href="https://lowkpro.com/blog/creating-c-closures-from-lua-closures.html">a
recent article</a> had me thinking about it again, and how I could do
better than before. In this article I will enhance Win32 window procedure
callbacks with a fifth argument, allowing us to more directly pass extra
context. I’m using <a href="https://github.com/skeeto/w64devkit">w64devkit</a> on x64, but the everything here should
work out-of-the-box with any x64 toolchain that speaks GNU assembly.</p>

<p>A <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nc-winuser-wndproc">window procedure</a> has this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">Wndproc</span><span class="p">(</span>
  <span class="n">HWND</span> <span class="n">hWnd</span><span class="p">,</span>
  <span class="n">UINT</span> <span class="n">Msg</span><span class="p">,</span>
  <span class="n">WPARAM</span> <span class="n">wParam</span><span class="p">,</span>
  <span class="n">LPARAM</span> <span class="n">lParam</span><span class="p">,</span>
<span class="p">);</span>
</code></pre></div></div>

<p>To create a window we must first register a class with <code class="language-plaintext highlighter-rouge">RegisterClass</code>,
which accepts a set of properties describing a window class, including a
pointer to one of these functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="p">...;</span>

    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">my_wndproc</span><span class="p">,</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>

    <span class="n">HWND</span> <span class="n">hwnd</span> <span class="o">=</span> <span class="n">CreateWindowExA</span><span class="p">(</span><span class="s">"my_class"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">state</span><span class="p">);</span>
</code></pre></div></div>

<p>The thread drives a message pump with events from the operating system,
dispatching them to this procedure, which then manipulates the program
state in response:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="n">MSG</span> <span class="n">msg</span><span class="p">;</span> <span class="n">GetMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);)</span> <span class="p">{</span>
        <span class="n">TranslateMessage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>
        <span class="n">DispatchMessageW</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="p">);</span>  <span class="c1">// calls the window procedure</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>All four <code class="language-plaintext highlighter-rouge">WNDPROC</code> parameters are determined by Win32. There is no context
pointer argument. So how does this procedure access the program state? We
generally have two options:</p>

<ol>
  <li>Global variables. Yucky but easy. Frequently seen in tutorials.</li>
  <li>A <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code> pointer attached to the window.</li>
</ol>

<p>The second option takes some setup. Win32 passes the last <code class="language-plaintext highlighter-rouge">CreateWindowEx</code>
argument to the window procedure when the window created, via <code class="language-plaintext highlighter-rouge">WM_CREATE</code>.
The procedure attaches the pointer to its window as <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>. This
pointer is passed indirectly, through a <code class="language-plaintext highlighter-rouge">CREATESTRUCT</code>. So ultimately it
looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">case</span> <span class="n">WM_CREATE</span><span class="p">:</span>
        <span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="n">cs</span> <span class="o">=</span> <span class="p">(</span><span class="n">CREATESTRUCT</span> <span class="o">*</span><span class="p">)</span><span class="n">lParam</span><span class="p">;</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">state</span> <span class="o">*</span><span class="p">)</span><span class="n">cs</span><span class="o">-&gt;</span><span class="n">lpCreateParams</span><span class="p">;</span>
        <span class="n">SetWindowLongPtr</span><span class="p">(</span><span class="n">hwnd</span><span class="p">,</span> <span class="n">GWLP_USERDATA</span><span class="p">,</span> <span class="p">(</span><span class="n">LONG_PTR</span><span class="p">)</span><span class="n">arg</span><span class="p">);</span>
        <span class="c1">// ...</span>
</code></pre></div></div>

<p>In future messages we can retrieve it with <code class="language-plaintext highlighter-rouge">GetWindowLongPtr</code>. Every time
I go through this I wish there was a better way. What if there was a fifth
window procedure parameter though which we could pass a context?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);
</code></pre></div></div>

<p>We’ll build just this as a trampoline. The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a>
passes the first four arguments in registers, and the rest are pushed on
the stack, including this new parameter. Our trampoline cannot just stuff
the extra parameter in the register, but will actually have to build a
stack frame. Slightly more complicated, but barely so.</p>

<h3 id="allocating-executable-memory">Allocating executable memory</h3>

<p>In previous articles, and in the programs where I’ve applied techniques
like this, I’ve allocated executable memory with <code class="language-plaintext highlighter-rouge">VirtualAlloc</code> (or <code class="language-plaintext highlighter-rouge">mmap</code>
elsewhere). This introduces a small challenge for solving the problem
generally: Allocations may be arbitrarily far from our code and data, out
of reach of relative addressing. If they’re further than 2G apart, we need
to encode absolute addresses, and in the simple case would just assume
they’re always too far apart.</p>

<p>These days I’ve more experience with executable formats, and allocation,
and I immediately see a better solution: Request a block of writable,
executable memory from the loader, then allocate our trampolines from it.
Other than being executable, this memory isn’t special, and <a href="/blog/2025/01/19/">allocation
works the usual way</a>, using functions unaware it’s executable. By
allocating through the loader, this memory will be part of our loaded
image, guaranteed to be close to our other code and data, allowing our JIT
compiler to assume <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models#small-code-model">a small code model</a>.</p>

<p>There are a number of ways to do this, and here’s one way to do it with
GNU-styled toolchains targeting COFF:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="nf">.section</span> <span class="nv">.exebuf</span><span class="p">,</span><span class="s">"bwx"</span>
        <span class="nf">.globl</span> <span class="nv">exebuf</span>
<span class="nl">exebuf:</span>	<span class="nf">.space</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span>
</code></pre></div></div>

<p>This assembly program defines a new section named <code class="language-plaintext highlighter-rouge">.exebuf</code> containing 2M
of writable (<code class="language-plaintext highlighter-rouge">"w"</code>), executable (<code class="language-plaintext highlighter-rouge">"x"</code>) memory, allocated at run time just
like <code class="language-plaintext highlighter-rouge">.bss</code> (<code class="language-plaintext highlighter-rouge">"b"</code>). We’ll treat this like an arena out of which we can
allocate all trampolines we’ll probably ever need. With careful use of
<code class="language-plaintext highlighter-rouge">.pushsection</code> this could be basic inline assembly, but I’ve left it as a
separate source. On the C side I retrieve this like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="n">Arena</span> <span class="nf">get_exebuf</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">extern</span> <span class="kt">char</span> <span class="n">exebuf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">21</span><span class="p">];</span>
    <span class="n">Arena</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="n">exebuf</span><span class="p">,</span> <span class="n">exebuf</span><span class="o">+</span><span class="k">sizeof</span><span class="p">(</span><span class="n">exebuf</span><span class="p">)};</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately I have to repeat myself on the size. There are different
ways to deal with this, but this is simple enough for now. I would have
loved to define the array in C with the GCC <a href="https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html"><code class="language-plaintext highlighter-rouge">section</code> attribute</a>,
but as is usually the case with this attribute, it’s not up to the task,
lacking the ability to set section flags. Besides, by not relying on the
attribute, any C compiler could compile this source, and we only need a
GNU-style toolchain to create the tiny COFF object containing <code class="language-plaintext highlighter-rouge">exebuf</code>.</p>

<p>While we’re at it, a reminder of some other basic definitions we’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="n">Str</span> <span class="nf">clone</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which have been discussed at length in previous articles.</p>

<h3 id="trampoline-compiler">Trampoline compiler</h3>

<p>From here the plan is to create a function that accepts a <code class="language-plaintext highlighter-rouge">Wndproc5</code> and a
context pointer to bind, and returns a classic <code class="language-plaintext highlighter-rouge">WNDPROC</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Wndproc5</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>Our window procedure now gets a fifth argument with the program state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LRESULT</span> <span class="nf">my_wndproc</span><span class="p">(</span><span class="n">HWND</span><span class="p">,</span> <span class="n">UINT</span><span class="p">,</span> <span class="n">WPARAM</span><span class="p">,</span> <span class="n">LPARAM</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When registering the class we wrap it in a trampoline compatible with
<code class="language-plaintext highlighter-rouge">RegisterClass</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">RegisterClassA</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">WNDCLASSA</span><span class="p">){</span>
        <span class="c1">// ...</span>
        <span class="p">.</span><span class="n">lpfnWndProc</span>   <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">),</span>
        <span class="p">.</span><span class="n">lpszClassName</span> <span class="o">=</span> <span class="s">"my_class"</span><span class="p">,</span>
        <span class="c1">// ...</span>
    <span class="p">});</span>
</code></pre></div></div>

<p>All windows using this class will readily have access to this state object
through their fifth parameter. It turns out setting up <code class="language-plaintext highlighter-rouge">exebuf</code> was the
more complicated part, and <code class="language-plaintext highlighter-rouge">make_wndproc</code> is quite simple!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WNDPROC</span> <span class="nf">make_wndproc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Wndproc5</span> <span class="n">proc</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">thunk</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"</span><span class="se">\x48\x83\xec\x28</span><span class="s">"</span>      <span class="c1">// sub   $40, %rsp</span>
        <span class="s">"</span><span class="se">\x48\xb8</span><span class="s">........"</span>      <span class="c1">// movq  $arg, %rax</span>
        <span class="s">"</span><span class="se">\x48\x89\x44\x24\x20</span><span class="s">"</span>  <span class="c1">// mov   %rax, 32(%rsp)</span>
        <span class="s">"</span><span class="se">\xe8</span><span class="s">...."</span>              <span class="c1">// call  proc</span>
        <span class="s">"</span><span class="se">\x48\x83\xc4\x28</span><span class="s">"</span>      <span class="c1">// add   $40, %rsp</span>
        <span class="s">"</span><span class="se">\xc3</span><span class="s">"</span>                  <span class="c1">// ret</span>
    <span class="p">);</span>
    <span class="n">Str</span> <span class="n">r</span>   <span class="o">=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">thunk</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">proc</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="mi">24</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span> <span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="mi">20</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rel</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rel</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">WNDPROC</span><span class="p">)</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The assembly allocates a new stack frame, with callee shadow space, and
with room for the new argument, which also happens to re-align the stack.
It stores the new argument for the <code class="language-plaintext highlighter-rouge">Wndproc5</code> just above the shadow space.
Then calls into the <code class="language-plaintext highlighter-rouge">Wndproc5</code> without touching other parameters. There
are two “patches” to fill out, which I’ve initially filled with dots: the
context pointer itself, and a 32-bit signed relative address for the call.
It’s going to be very near the callee. The only thing I don’t like about
this function is that I’ve manually worked out the patch offsets.</p>

<p>It’s probably not useful, but it’s easy to update the context pointer at
any time if hold onto the trampoline pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_wndproc_arg</span><span class="p">(</span><span class="n">WNDPROC</span> <span class="n">p</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="o">+</span><span class="mi">6</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">arg</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So, for instance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">MyState</span> <span class="o">*</span><span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// multiple states</span>
    <span class="n">WNDPROC</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="c1">// ...</span>
    <span class="n">set_wndproc_arg</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>  <span class="c1">// switch states</span>
</code></pre></div></div>

<p>Though I expect the most common case is just creating multiple procedures:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WNDPROC</span> <span class="n">procs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
        <span class="n">make_wndproc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">my_wndproc</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>To my slight surprise these trampolines still work with an active <a href="https://learn.microsoft.com/en-us/windows/win32/secbp/control-flow-guard">Control
Flow Guard</a> system policy. Trampolines do not have stack unwind
entries, and I thought Windows might refuse to pass control to them.</p>

<p>Here’s a complete, runnable example if you’d like to try it yourself:
<a href="https://gist.github.com/skeeto/13363b78489b26bed7485ec0d6b2c7f8"><code class="language-plaintext highlighter-rouge">main.c</code> and <code class="language-plaintext highlighter-rouge">exebuf.s</code></a></p>

<h3 id="better-cases">Better cases</h3>

<p>This is more work than going through <code class="language-plaintext highlighter-rouge">GWLP_USERDATA</code>, and real programs
have a small, fixed number of window procedures — typically one — so this
isn’t the best example, but I wanted to illustrate with a real interface.
Again, perhaps the best real use is a library with a weak custom allocator
interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">);</span>   <span class="c1">// no context pointer!</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>     <span class="c1">// "</span>
<span class="p">}</span> <span class="n">Allocator</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">arena_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>

<span class="c1">// ...</span>

    <span class="n">Allocator</span> <span class="n">perm_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">Allocator</span> <span class="n">scratch_allocator</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">make_trampoline</span><span class="p">(</span><span class="n">exearena</span><span class="p">,</span> <span class="n">arena_malloc</span><span class="p">,</span> <span class="n">scratch</span><span class="p">);</span>
        <span class="p">.</span><span class="n">free</span>   <span class="o">=</span> <span class="n">noop_free</span><span class="p">,</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>Something to keep in my back pocket for the future.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Hierarchical field sort with string interning</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/09/24/"/>
    <id>urn:uuid:30d4b889-d14b-4b32-b389-858fb3dde34b</id>
    <updated>2025-09-24T17:11:32Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>In a recent, real world problem I needed to load a heterogeneous sequence
of records from a buffer. Record layout is defined in a header before the
sequence. Each field is numeric, with a unique name composed of non-empty
alphanumeric period-delimited segments, where segments signify nested
structure. Field names are a comma-delimited list, in order of the record
layout. The catch motivating this article is that nested structures are
not necessarily contiguous. In my transformed representation I needed
nested structures to be contiguous. For illustrative purposes here, it
will be for JSON output. I came up with what I think is an interesting
solution, which I’ve implemented in C using <a href="/blog/2025/01/19/">techniques previously
discussed</a>.</p>

<p>The above description is probably confusing on its own, and an example is
worth a thousand words, so here’s a listing naming 7 fields:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>timestamp,point.x,point.y,foo.bar.z,point.z,foo.bar.y,foo.bar.x
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">point</code> is a substructure, as is <code class="language-plaintext highlighter-rouge">foo</code> and <code class="language-plaintext highlighter-rouge">bar</code>, but note they’re
interleaved in the record. So if a record contains these values:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{1758158348, 1.23, 4.56, -100, 7.89, -200, -300}
</code></pre></div></div>

<p>The JSON representation would look like:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"timestamp"</span><span class="p">:</span><span class="w"> </span><span class="mi">1758158348</span><span class="p">,</span><span class="w">
  </span><span class="nl">"point"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"x"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.23</span><span class="p">,</span><span class="w">
    </span><span class="nl">"y"</span><span class="p">:</span><span class="w"> </span><span class="mf">4.56</span><span class="p">,</span><span class="w">
    </span><span class="nl">"z"</span><span class="p">:</span><span class="w"> </span><span class="mf">7.89</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"foo"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"bar"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"z"</span><span class="p">:</span><span class="w"> </span><span class="mi">-100</span><span class="p">,</span><span class="w">
      </span><span class="nl">"y"</span><span class="p">:</span><span class="w"> </span><span class="mi">-200</span><span class="p">,</span><span class="w">
      </span><span class="nl">"x"</span><span class="p">:</span><span class="w"> </span><span class="mi">-300</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Notice <code class="language-plaintext highlighter-rouge">point.z</code> moved up and <code class="language-plaintext highlighter-rouge">foo.bar.z</code> down, so that substructures are
contiguous in this representation as required for JSON. Sorting the field
names lexicographically would group them together as a simple solution.
However, as an additional constraint I want to retain the original field
order as much as possible. For example, <code class="language-plaintext highlighter-rouge">timestamp</code> is first in both the
original and JSON representations, but sorting would put it last. If all
substructures are already contiguous, nothing should change.</p>

<h3 id="solution-with-string-interning">Solution with string interning</h3>

<p>My solution is to intern the segment strings, assigning each a unique,
monotonic integral token in the order they’re observed. In my program,
zero is reserved as a special “root” token, and so the first string has
the value 1. The concrete values aren’t important, only that they’re
assigned monotonically.</p>

<p>The trick is that a string is always interned in the “namespace” of a
previous token. That is, we’re building a <code class="language-plaintext highlighter-rouge">(token, string) -&gt; token</code> map.
For our segments that namespace is the token for the parent structure, and
the top-level fields are interned in the reserved “root” namespace. When
applied to the example, we get the token sequences:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>timestamp  -&gt; 1
point.x    -&gt; 2 3
point.y    -&gt; 2 4
foo.bar.z  -&gt; 5 6 7
point.z    -&gt; 2 8
foo.bar.y  -&gt; 5 6 9
foo.bar.x  -&gt; 5 6 10
</code></pre></div></div>

<p>And our map looks like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{0, "timestamp"} -&gt; 1
{0, "point"}     -&gt; 2
{2, "x"}         -&gt; 3
{2, "y"}         -&gt; 4
{0, "foo"}       -&gt; 5
{5, "bar"}       -&gt; 6
{6, "z"}         -&gt; 7
{2, "z"}         -&gt; 8
{6, "y"}         -&gt; 9
{6, "x"}         -&gt; 10
</code></pre></div></div>

<p>Notice how <code class="language-plaintext highlighter-rouge">"x"</code> is assigned 3 and 10 due to different namespaces. That’s
important because otherwise the fields of <code class="language-plaintext highlighter-rouge">foo.bar</code> would sort in the same
order as <code class="language-plaintext highlighter-rouge">point</code>. Namespace gives these fields unique identities.</p>

<p>Once we have the token representation, sort lexicographically <em>by token</em>.
That pulls <code class="language-plaintext highlighter-rouge">point.z</code> up to its siblings.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>timestamp  -&gt; 1
point.x    -&gt; 2 3
point.y    -&gt; 2 4
point.z    -&gt; 2 8
foo.bar.z  -&gt; 5 6 7
foo.bar.y  -&gt; 5 6 9
foo.bar.x  -&gt; 5 6 10
</code></pre></div></div>

<p>Now we have the “output” order with minimal re-ordering. If substructures
were already contiguous, nothing changes. Assuming a reasonable map, this
is <code class="language-plaintext highlighter-rouge">O(n log n)</code>, primarily due to sorting.</p>

<h4 id="alternatives">Alternatives</h4>

<p>Before I thought of namespaces, my initial idea was to intern the whole
prefix of a segment. The sequence of look-ups would be:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"timestamp"    -&gt; 1  -&gt; {1}
"point"        -&gt; 2
"point.x"      -&gt; 3  -&gt; {2, 3}
"point"        -&gt; 2
"point.y"      -&gt; 4  -&gt; {2, 4}
"foo"          -&gt; 5
"foo.bar"      -&gt; 6
"foo.bar.z"    -&gt; 7  -&gt; {5, 6, 7}
"point"        -&gt; 2
"point.z"      -&gt; 8  -&gt; {2, 8}
"foo"          -&gt; 5
"foo.bar"      -&gt; 6
"foo.bar.y"    -&gt; 9  -&gt; {5, 6, 9}
"foo"          -&gt; 5
"foo.bar"      -&gt; 6
"foo.bar.x"    -&gt; 10 -&gt; {5, 6, 10}
</code></pre></div></div>

<p>Ultimately it produces the same tokens, and this is a more straightforward
<code class="language-plaintext highlighter-rouge">string -&gt; string</code> map. The prefixes are acting as namespaces. However, I
wrote it this way as a kind of visual proof: Notice the right triangle
shape formed by the strings for each field. From the area we can see that
processing prefixes as strings is <code class="language-plaintext highlighter-rouge">O(n^2)</code> quadratic time on the number of
segments! In my real problem the inputs were never large enough for this
to matter, but I hate <a href="https://randomascii.wordpress.com/category/quadratic/">leaving behind avoidable quadratic algorithms</a>.
Using a token as a namespace flattens the prefix to a constant size.</p>

<p>Another option is a different map for each namespace. So for <code class="language-plaintext highlighter-rouge">foo.bar.z</code>
lookup the <code class="language-plaintext highlighter-rouge">"foo"</code> map <code class="language-plaintext highlighter-rouge">(string -&gt; map)</code> in the root <code class="language-plaintext highlighter-rouge">(string -&gt; map)</code>,
then within that lookup the <code class="language-plaintext highlighter-rouge">"bar"</code> table <code class="language-plaintext highlighter-rouge">(string -&gt; token)</code> (since this
is the penultimate segment), then intern <code class="language-plaintext highlighter-rouge">"z"</code> within that to get its
token. That wouldn’t have quadratic time complexity, but it seems quite a
bit more complicated than a single, flat <code class="language-plaintext highlighter-rouge">(token, string) -&gt; token</code> map.</p>

<h3 id="implementation-in-c">Implementation in C</h3>

<p>Because <a href="/blog/2023/02/11/">the standard library has little useful for us</a>, I am
building on <a href="/blog/2025/01/19/"><strong>previously-established definitions</strong></a>, so refer to
that article for basic definitions like <code class="language-plaintext highlighter-rouge">Str</code>. To start off, tokens will
be a size-typed integer so we never need to worry about overflowing the
token counter. We’d run out of memory first:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="n">ptrdiff</span> <span class="n">Token</span><span class="p">;</span>
</code></pre></div></div>

<p>We’re building a <code class="language-plaintext highlighter-rouge">(token, string) -&gt; token)</code> map, so we’ll need a hash
function for such keys:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash</span><span class="p">(</span><span class="n">Token</span> <span class="n">t</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">t</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">ptrdiff</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="n">r</span> <span class="o">*=</span> <span class="mi">1111111111111111111u</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The map itself is a forever-useful <a href="/blog/2023/09/30/">hash trie</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Map</span> <span class="n">Map</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Map</span> <span class="p">{</span>
    <span class="n">Map</span>  <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">Token</span> <span class="n">namespace</span><span class="p">;</span>
    <span class="n">Str</span>   <span class="n">segment</span><span class="p">;</span>
    <span class="n">Token</span> <span class="n">token</span><span class="p">;</span>
<span class="p">};</span>

<span class="n">Token</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">Map</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">Token</span> <span class="n">namespace</span><span class="p">,</span> <span class="n">Str</span> <span class="n">segment</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span> <span class="n">segment</span><span class="p">);</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">namespace</span><span class="o">==</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">namespace</span> <span class="o">&amp;&amp;</span> <span class="n">equals</span><span class="p">(</span><span class="n">segment</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">segment</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">token</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Map</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">namespace</span> <span class="o">=</span> <span class="n">namespace</span><span class="p">;</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">segment</span> <span class="o">=</span> <span class="n">segment</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">token</span><span class="p">;</span>  <span class="c1">// caller will assign</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We’ll use this map to convert a string naming a field into a sequence of
tokens, so we’ll <a href="/blog/2025/06/26/">need a slice</a>. Fields also have an offset within
the record and a type, which we’ll track via its original ordering, which
I’ll do with an <code class="language-plaintext highlighter-rouge">index</code> field (e.g. into the original header). Also track
the original name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span>          <span class="n">name</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span>    <span class="n">index</span><span class="p">;</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Token</span><span class="p">)</span> <span class="n">tokens</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Field</span><span class="p">;</span>
</code></pre></div></div>

<p>To sort fields we’ll need a comparator:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ptrdiff_t</span> <span class="nf">field_compare</span><span class="p">(</span><span class="n">Field</span> <span class="n">a</span><span class="p">,</span> <span class="n">Field</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Token</span> <span class="n">d</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">b</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">d</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">len</span> <span class="o">-</span> <span class="n">b</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because field names are unique, each token sequence is unique, and so we
need not use <code class="language-plaintext highlighter-rouge">index</code> in the comparator.</p>

<p>Finally down to business: <a href="/blog/2025/03/02/">cut up the list</a> and build the token
sequences with the established <code class="language-plaintext highlighter-rouge">push</code> macro. The sort function isn’t
interesting, and could be as simple as libc <code class="language-plaintext highlighter-rouge">qsort</code> with the above
comparator (and adapter), so I’m only listing the prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">field_sort</span><span class="p">(</span><span class="n">Slice</span><span class="p">(</span><span class="n">Field</span><span class="p">),</span> <span class="n">Arena</span> <span class="n">scratch</span><span class="p">);</span>

<span class="n">Slice</span><span class="p">(</span><span class="n">Field</span><span class="p">)</span> <span class="n">parse_fields</span><span class="p">(</span><span class="n">Str</span> <span class="n">fieldlist</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Field</span><span class="p">)</span> <span class="n">fields</span>  <span class="o">=</span> <span class="p">{};</span>
    <span class="n">Map</span>         <span class="o">*</span><span class="n">strtab</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span>    <span class="n">ntokens</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">c</span> <span class="o">=</span> <span class="p">{.</span><span class="n">tail</span><span class="o">=</span><span class="n">fieldlist</span><span class="p">,</span> <span class="p">.</span><span class="n">ok</span><span class="o">=</span><span class="nb">true</span><span class="p">};</span> <span class="n">c</span><span class="p">.</span><span class="n">ok</span><span class="p">;)</span> <span class="p">{</span>
        <span class="n">c</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">','</span><span class="p">);</span>
        <span class="n">Field</span> <span class="n">field</span> <span class="o">=</span> <span class="p">{};</span>
        <span class="n">field</span><span class="p">.</span><span class="n">name</span>  <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="n">field</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">fields</span><span class="p">.</span><span class="n">len</span><span class="p">;</span>

        <span class="n">Token</span> <span class="n">prev</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">Cut</span> <span class="n">f</span> <span class="o">=</span> <span class="p">{.</span><span class="n">tail</span><span class="o">=</span><span class="n">field</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="p">.</span><span class="n">ok</span><span class="o">=</span><span class="nb">true</span><span class="p">};</span> <span class="n">f</span><span class="p">.</span><span class="n">ok</span><span class="p">;)</span> <span class="p">{</span>
            <span class="n">f</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'.'</span><span class="p">);</span>
            <span class="n">Token</span> <span class="o">*</span><span class="n">token</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">strtab</span><span class="p">,</span> <span class="n">prev</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!*</span><span class="n">token</span><span class="p">)</span> <span class="p">{</span>
                <span class="o">*</span><span class="n">token</span> <span class="o">=</span> <span class="o">++</span><span class="n">ntokens</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">field</span><span class="p">.</span><span class="n">tokens</span><span class="p">)</span> <span class="o">=</span> <span class="o">*</span><span class="n">token</span><span class="p">;</span>
            <span class="n">prev</span> <span class="o">=</span> <span class="o">*</span><span class="n">token</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fields</span><span class="p">)</span> <span class="o">=</span> <span class="n">field</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">field_sort</span><span class="p">(</span><span class="n">fields</span><span class="p">,</span> <span class="o">*</span><span class="n">a</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">fields</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage here suggests <code class="language-plaintext highlighter-rouge">Cut::ok</code> should be inverted to <code class="language-plaintext highlighter-rouge">Cut::done</code> so that it
better zero-initializes. Something I’ll need to consider. Because it’s all
allocated from an arena, no need for destructors or anything like that, so
this is the complete implementation. Back to the example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Str</span> <span class="n">fieldlist</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span>
        <span class="s">"timestamp,"</span>
        <span class="s">"point.x,"</span>
        <span class="s">"point.y,"</span>
        <span class="s">"foo.bar.z,"</span>
        <span class="s">"point.z,"</span>
        <span class="s">"foo.bar.y,"</span>
        <span class="s">"foo.bar.x"</span>
    <span class="p">);</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Field</span><span class="p">)</span> <span class="n">fields</span> <span class="o">=</span> <span class="n">parse_fields</span><span class="p">(</span><span class="n">fieldlist</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">fields</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">name</span> <span class="o">=</span> <span class="n">fields</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">name</span><span class="p">;</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">name</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">name</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
        <span class="n">putchar</span><span class="p">(</span><span class="sc">'\n'</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>This program will print the proper output field order. In a real program
we’d hold onto the string table, define an inverse lookup to translate
tokens back into strings, and use it when in producing output. I do just
that in my exploratory program, <a href="https://github.com/skeeto/scratch/blob/master/rec2json/rec2json.c"><strong><code class="language-plaintext highlighter-rouge">rec2json.c</code></strong></a>, written a little
differently than presented above. It uses the sorted tokens to compile a
simple bytecode program that, when run against a record, produces its JSON
representation. It compiles the example to:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OPEN          # print '{'
KEY     1     # print token 1 as a key, i.e. "timestamp:"
READ    0     # print double at record offset 0
COMMA         # print ','
KEY     2     # print token 2 as a key, i.e. "point:"
OPEN
KEY     3
READ    8     # print double at record offset 8
COMMA
KEY     4
READ    16
COMMA
KEY     8
READ    32
CLOSE         # print '}'
COMMA
KEY     5
OPEN
KEY     6
OPEN
KEY     7
READ    24
COMMA
KEY     9
READ    40
COMMA
KEY     10
READ    48
CLOSE
CLOSE
CLOSE
</code></pre></div></div>

<p>Seeing it written out, I notice more room for improvement. An optimization
pass could coalesce instructions so that, for instance, <code class="language-plaintext highlighter-rouge">OPEN</code> then <code class="language-plaintext highlighter-rouge">KEY</code>
<a href="/blog/2024/05/25/">concatenate</a> to a single string at compile time so that it only needs
one instruction. This program could be 15 instructions instead of 31. In
my real case I didn’t need anything quite this sophisticated, but it was
fun to explore.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Parameterized types in C using the new tag compatibility rule</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/06/26/"/>
    <id>urn:uuid:abb3bf93-074f-4876-8d46-42997edebb34</id>
    <updated>2025-06-26T23:49:53Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>C23 has <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3037.pdf">a new rule for struct, union, and enum compatibility</a>
finally appearing in compilers starting with GCC 15, released this past
April, and Clang later this year. The same struct defined in different
translation units (TU) has always been compatible — essential to how they
work. Until this rule change, each such definition within a TU was a
distinct, incompatible type. The new rule says that, <em>ackshually</em>, they
are compatible! This unlocks some type parameterization using macros.</p>

<p>How can a TU have multiple definitions of a struct? Scope. Prior to C23
this wouldn’t compile because the compound literal type and the return
type were distinct types:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Example</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">;</span> <span class="p">};</span>

<span class="k">struct</span> <span class="n">Example</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">Example</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">;</span> <span class="p">};</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">Example</span><span class="p">){</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Otherwise the definition of <code class="language-plaintext highlighter-rouge">struct Example</code> within <code class="language-plaintext highlighter-rouge">example</code> was fine, if
strange. At first this may not seem like a big deal, but let’s <a href="/blog/2025/01/19/">revisit my
technique for dynamic arrays</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">T</span>        <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
<span class="p">}</span> <span class="n">SliceT</span><span class="p">;</span>
</code></pre></div></div>

<p>Where I write out one of these for each <code class="language-plaintext highlighter-rouge">T</code> that I might want to put into
a slice. With the new rule we can change it slightly, taking note of the
introduction of a tag (the name after <code class="language-plaintext highlighter-rouge">struct</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define Slice(T)        \
    struct Slice##T {   \
        T        *data; \
        ptrdiff_t len;  \
        ptrdiff_t cap;  \
    }
</span></code></pre></div></div>

<p>This makes the “write it out ahead of time” thing simpler, but with the
new rule we can skip the “ahead of time” part and conjure slice types on
demand. Each declaration with the same <code class="language-plaintext highlighter-rouge">T</code> is compatible with the others
due to matching tags and fields. So, for example, with this macro we can
declare functions using slices parameterized for different element types.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Slice</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="n">range</span><span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">float</span> <span class="nf">mean</span><span class="p">(</span><span class="n">Slice</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>

<span class="n">Slice</span><span class="p">(</span><span class="n">Str</span><span class="p">)</span> <span class="n">split</span><span class="p">(</span><span class="n">Str</span><span class="p">,</span> <span class="kt">char</span> <span class="n">delim</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">Str</span> <span class="nf">join</span><span class="p">(</span><span class="n">Slice</span><span class="p">(</span><span class="n">Str</span><span class="p">),</span> <span class="kt">char</span> <span class="n">delim</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Or using it with <a href="/blog/2025/03/02/">our model parser</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Vec3</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">v</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">n</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
<span class="p">}</span> <span class="n">Face</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Vec3</span><span class="p">)</span> <span class="n">verts</span><span class="p">;</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Vec3</span><span class="p">)</span> <span class="n">norms</span><span class="p">;</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Face</span><span class="p">)</span> <span class="n">faces</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Model</span><span class="p">;</span>

<span class="k">typedef</span> <span class="n">Slice</span><span class="p">(</span><span class="n">Vec3</span><span class="p">)</span> <span class="n">Polygon</span><span class="p">;</span>
</code></pre></div></div>

<p>I worried these macros might confuse my tools, particularly <a href="https://github.com/universal-ctags/ctags">Universal
Ctags</a> because <a href="https://github.com/skeeto/w64devkit">it’s important to me</a>. Everything handles
prototypes better than expected, but ctags doesn’t see fields with slice
types. Overall they’re like a very limited form of C++ templates. Though
only the types are parameterized, not the functions operating on those
types. Outside of unwarranted macro abuse, this new technique does nothing
regarding generic functions. On the other hand, my generic slice function
complements the new technique, especially with the help of C23’s new
<code class="language-plaintext highlighter-rouge">typeof</code> to mitigate <code class="language-plaintext highlighter-rouge">_Alignof</code>’s limitations:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span> <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">,</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span> <span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">int</span> <span class="n">align</span><span class="p">);</span>

<span class="cp">#define push(a, s)                          \
  ((s)-&gt;len == (s)-&gt;cap                     \
    ? (s)-&gt;data = push_(                    \
        (a),                                \
        (s)-&gt;data,                          \
        &amp;(s)-&gt;cap,                          \
        sizeof(*(s)-&gt;data),                 \
        _Alignof(typeof(*(s)-&gt;data))        \
      ),                                    \
      (s)-&gt;data + (s)-&gt;len++                \
    : (s)-&gt;data + (s)-&gt;len++)
</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">push_</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">pcap</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">int</span> <span class="n">align</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span> <span class="o">=</span> <span class="o">*</span><span class="n">pcap</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">!=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">data</span> <span class="o">+</span> <span class="n">cap</span><span class="o">*</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">copy</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">cap</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">cap</span><span class="o">*</span><span class="n">size</span><span class="p">);</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">copy</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">ptrdiff_t</span> <span class="n">extend</span> <span class="o">=</span> <span class="n">cap</span> <span class="o">?</span> <span class="n">cap</span> <span class="o">:</span> <span class="mi">4</span><span class="p">;</span>
    <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">extend</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">);</span>
    <span class="o">*</span><span class="n">pcap</span> <span class="o">=</span> <span class="n">cap</span> <span class="o">+</span> <span class="n">extend</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This exploits the fact that implementations adopting the new tag rule also
have <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3322.pdf">the upcoming C2y null pointer rule</a> (note: also requires a
cooperating libc). Putting it together, now I can write stuff like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Slice</span><span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span> <span class="n">generate_primes</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">limit</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Slice</span><span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span> <span class="n">primes</span> <span class="o">=</span> <span class="p">{};</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">limit</span> <span class="o">&gt;</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">primes</span><span class="p">)</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int64_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">limit</span><span class="p">;</span> <span class="n">n</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">bool</span> <span class="n">valid</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">valid</span> <span class="o">&amp;&amp;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">primes</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">valid</span> <span class="o">=</span> <span class="n">n</span> <span class="o">%</span> <span class="n">primes</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">valid</span><span class="p">)</span> <span class="p">{</span>
            <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">primes</span><span class="p">)</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">primes</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>But it doesn’t take long to run into limitations. It makes little sense to
define, say, a <code class="language-plaintext highlighter-rouge">Map(K, V)</code> without a generic function to manipulate it.
This also doesn’t work:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Str</span><span class="p">)</span>          <span class="n">names</span><span class="p">;</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Slice</span><span class="p">(</span><span class="kt">float</span><span class="p">))</span> <span class="n">edges</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Graph</span><span class="p">;</span>
</code></pre></div></div>

<p>Due to <code class="language-plaintext highlighter-rouge">Slice##T</code> in the macro, required to establish a unique tag for
each element type. The parameter to the macro must be an identifier, so
you have to build up to it (or define another macro), which sort of
defeats the purpose, which was entirely about convenience.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="n">Slice</span><span class="p">(</span><span class="kt">float</span><span class="p">)</span> <span class="n">Edges</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Str</span><span class="p">)</span>   <span class="n">names</span><span class="p">;</span>
    <span class="n">Slice</span><span class="p">(</span><span class="n">Edges</span><span class="p">)</span> <span class="n">edges</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Graph</span><span class="p">;</span>
</code></pre></div></div>

<p>The benefits are small enough that perhaps it’s not worth the costs, but
it’s been at least worth investigating. I’ve written a small demo of the
technique if you’d like to see it in action, or test the abilities of your
local C implementation: <a href="https://gist.github.com/skeeto/3fe27cd81ca5bdb4926b12e03bdfbc62"><code class="language-plaintext highlighter-rouge">demo.c</code></a></p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>WebAssembly: How to allocate your allocator</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/04/19/"/>
    <id>urn:uuid:dc2863e4-9601-4e42-bbd2-3cb4a5315d4d</id>
    <updated>2025-04-19T03:18:20Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>An early, small hurdle <a href="/blog/2025/04/04/">diving into WebAssembly</a> was allocating my
allocator. On a server or desktop with virtual memory, the allocator asks
the operating system to map fresh pages into its address space (<a href="https://pubs.opengroup.org/onlinepubs/7908799/xsh/brk.html">sbrk</a>,
anonymous mmap, <a href="https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc">VirtualAlloc</a>), which it then dynamically allocates to
different purposes. In an embedded context, dynamic allocation memory is
typically a fixed, static region chosen at link time. The Wasm execution
environment more resembles an embedded system, but both kinds of obtaining
raw memory are viable and useful in different situations.</p>

<p>For the purposes of this discussion, the actual allocator isn’t important.
It could be <a href="/blog/2023/09/27/">a simple arena allocator</a>, or a more general purpose
<a href="https://github.com/skeeto/scratch/blob/master/misc/buddy.c">buddy allocator</a>. It could even be garbage collected with <a href="https://www.hboehm.info/gc/">Boehm
GC</a>. Though WebAssembly’s linear memory is a poor fit for such a
<a href="https://en.wikipedia.org/wiki/Tracing_garbage_collection#Precise_vs._conservative_and_internal_pointers">conservative</a> garbage collector. In a compact address space starting
at zero, and which doesn’t include code, memory addresses will be small
numbers, and less distinguishable from common integer values. There’s also
the issue that the garbage collector cannot scan the Wasm stack, which is
hidden from Wasm programs by design. Only the ABI stack is visible. So a
garbage collector requires cooperation from the compiler — essentially as a
distinct calling convention — to spill all heap pointers on the ABI stack
before function calls. Wasm C and C++ toolchains do not yet support this
in a practical capacity.</p>

<h3 id="exporting-a-static-heap">Exporting a static heap</h3>

<p>Let’s start with the embedded case because it’s simpler, and reserve a
dynamic memory region at link time. WebAssembly has just reached 8 years
old, so it’s early, and as we keep discovering, Wasm tooling is still
immature. <code class="language-plaintext highlighter-rouge">wasm-ld</code> doesn’t understand linker scripts, and there’s no
stable, low-level assembly language on which to build, e.g. to reserve
space, define symbols, etc. WAT is too high level and inflexible for this
purpose, as we’ll soon see. So our only option is to brute force it in a
high-level language:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">heap</span><span class="p">[</span><span class="mi">16</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">];</span>  <span class="c1">// 16MiB</span>
</code></pre></div></div>

<p>Plugging it into an arena allocator:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="n">Arena</span> <span class="nf">getarena</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">heap</span><span class="p">;</span>
    <span class="n">a</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="n">heap</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">heap</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately <code class="language-plaintext highlighter-rouge">heap</code> isn’t generic memory, but a high-level variable with
a specific, fixed type. That’s why it would have been nice to reserve the
memory outside the high-level language. In practice this works fine so
long as everything is aligned, but strictly speaking, allocating any
variable except <code class="language-plaintext highlighter-rouge">char</code> from this arena involves incompatible loads and
stores on a <code class="language-plaintext highlighter-rouge">char</code> array. Clang doesn’t document any <a href="/blog/2024/12/20/">inline assembly
interface</a> for Wasm, but neither does Clang forbid it. That leaves
just enough room to launder the pointer if you’re worried about this
technicality:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Arena</span> <span class="nf">getarena</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">heap</span><span class="p">;</span>
    <span class="n">asm</span> <span class="p">(</span><span class="s">""</span> <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">beg</span><span class="p">));</span>  <span class="c1">// launder</span>
    <span class="n">a</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">heap</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">+r</code> means <code class="language-plaintext highlighter-rouge">a.beg</code> is both input and output. The address of the heap
goes into the black box, and as far as the compiler is concerned, <em>some
mystery address comes out</em> which, critically, has no effective type. The
assembly block is empty (<code class="language-plaintext highlighter-rouge">""</code>), so it’s just a no-op, and we know (<em>wink
wink</em>) it’s really the same address. Because the heap was “used” by the
black box, Clang won’t optimize the heap out of existence beneath us. Also
note that <code class="language-plaintext highlighter-rouge">a.end</code> was derived from the laundered pointer.</p>

<p><strong>Update</strong>: The next C standard <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3238.pdf">will improve this situation</a> and
so pointer laundering will no longer be unnecessary. A straight <code class="language-plaintext highlighter-rouge">char</code>
array could be used as an arena.</p>

<p>This static variable technique works well only in an <em>exported</em> memory
configuration, which is what <code class="language-plaintext highlighter-rouge">wasm-ld</code> uses by default. When a module
exports its memory, it indicates how much linear memory it requires on
start, and the Wasm runtime allocates and zero-initializes it at module
initialization time. C and C++ toolchains depend on that runtime zeroing
to initialize static and global variables, which are defined to be so
initialized. Compilers generate code assuming these variables are zero
initialized. This same paradigm is used for <a href="https://en.wikipedia.org/wiki/.bss">.bss sections</a> in hosted
environments.</p>

<p>In an <em>imported</em> memory configuration, linear memory is uninitialized. The
memory may be re-used from, say, a destroyed module without zeroing, and
may contain arbitrary data. In that case, C and C++ toolchains must zero
the memory explicitly. It could potentially be done with a <code class="language-plaintext highlighter-rouge">memory.fill</code>
instruction in the <em>start section</em>, but LLVM does not support start
sections. Instead it uses an <a href="https://github.com/WebAssembly/bulk-memory-operations/blob/master/proposals/bulk-memory-operations/Overview.md"><em>active data segment</em></a> — a chunk of
data copied into linear memory by the Wasm runtime during initialization,
before running the start function.</p>

<p>That is, when importing memory, LLVM <em>actually stores all those
zeros in the Wasm module</em> so that the runtime can copy it into linear
memory. Wasm has no built-in compression, so <strong>your Wasm module will be at
least as large as your heap</strong>! Exporting or importing memory is determined
at <em>link-time</em>, so at <em>compile-time</em> the compiler must assume the worst
case. If you compile the example above, you get a 16MiB “object” file (in
Wasm format):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -c -O example.c
$ du -h example.o
16.0M   example.o
</code></pre></div></div>

<p>The WAT version of this file is 48MiB — clearly unsuitable as a low-level
assembler. If linking with exported memory, <code class="language-plaintext highlighter-rouge">wasm-ld</code> discards all-zero
active data segments. If using an imported memory configuration, it’s
copied into the final image, producing a huge Wasm image, though highly
compressible. As a rule, avoid importing memory when using an LLVM
toolchain. Regardless, large heaps created this way will have a
significant compile-time cost.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time echo 'char heap[256&lt;&lt;20];' | clang --target=wasm32 -c -xc -
real    0m0.334s
user    0m0.013s
sys     0m0.262s
</code></pre></div></div>

<p>(If only Clang had some sort of “noinit” variable attribute in order to
allow <code class="language-plaintext highlighter-rouge">heap</code> to be uninitialized…)</p>

<h3 id="growing-a-dynamic-heap">Growing a dynamic heap</h3>

<p>Wasm programs can grow linear memory using an <a href="https://pubs.opengroup.org/onlinepubs/7908799/xsh/brk.html">sbrk</a>-like <a href="https://webassembly.github.io/spec/core/bikeshed/#syntax-instr-memory"><code class="language-plaintext highlighter-rouge">memory.grow</code>
instruction</a>. It operates in quantities of pages (64kB), and returns
the old memory size. Because memory starts at zero, the old memory size is
also the base address of the new allocation. Clang provides access to this
instruction via an undocumented built-in:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="nf">__builtin_wasm_memory_grow</span><span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>The first parameter selects a memory because <a href="https://github.com/WebAssembly/multi-memory/blob/main/proposals/multi-memory/Overview.md">someday there might be more
than one</a>. From this built-in we can define <code class="language-plaintext highlighter-rouge">sbrk</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">sbrk</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">npages</span> <span class="o">=</span> <span class="p">(</span><span class="n">size</span> <span class="o">+</span> <span class="mh">0xffffu</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>  <span class="c1">// round up</span>
    <span class="kt">size_t</span> <span class="n">old</span>    <span class="o">=</span> <span class="n">__builtin_wasm_memory_grow</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">npages</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">old</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1ul</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="n">old</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To which Clang compiles (note the <code class="language-plaintext highlighter-rouge">memory.grow</code>):</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nf">func</span> <span class="nv">$sbrk</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">i32</span><span class="p">)</span>
  <span class="p">(</span><span class="nf">select</span>
    <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">0</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">shl</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">tee</span> <span class="mi">0</span>
        <span class="p">(</span><span class="nf">memory</span><span class="o">.</span><span class="nv">grow</span>
          <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">shr_u</span>
            <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">add</span>
              <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)</span>
              <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">65535</span><span class="p">))</span>
            <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">16</span><span class="p">))))</span>
      <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">16</span><span class="p">))</span>
    <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">eq</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)</span>
      <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">-1</span><span class="p">))))</span>
</code></pre></div></div>

<p>Applying that to create an arena like before:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Arena</span> <span class="nf">newarena</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">sbrk</span><span class="p">(</span><span class="n">cap</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">beg</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">a</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">+</span> <span class="n">cap</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now we can choose the size of the arena, and we can use this to create
multiple arenas (e.g. permanent, scratch, etc.). We could even continue
growing the last-created arena in-place when it’s full.</p>

<p>If there was no <code class="language-plaintext highlighter-rouge">memory.grow</code> instruction, it could be implemented as a
request through an imported function. The embedder using the Wasm runtime
<a href="https://developer.mozilla.org/en-US/docs/WebAssembly/Reference/JavaScript_interface/Memory/grow">can grow the memory on the module’s behalf</a> in the same manner. But
as that documentation indicates, either way growing the memory comes with
a downside in the most common Wasm runtimes, browsers: It “detaches” the
memory from references, which complicates its use for the embedder. If a
Wasm module may grow its memory at any time, the embedder must reacquire
the memory handle after every call. It’s not difficult, but it’s easy to
forget, and mistakes are likely to go unnoticed until later.</p>

<h3 id="importing-a-dynamic-heap">Importing a dynamic heap</h3>

<p>There’s a middle ground where a Wasm module imports a dynamic-sized heap.
That is, linear memory beyond the module’s base initialization. This might
be the case, for instance, in a programming competition, where contestants
submit Wasm modules which must complete a task using the supplied memory.
In that case we don’t reserve a static heap, so we’re not facing the
storing-zeros issue. However, how do we “find” the memory? Linear memory
layout will look something like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 &lt;-- stack | data | heap --&gt; ?
|-----------------------------|
</code></pre></div></div>

<p>This diagram reflects the more sensible <code class="language-plaintext highlighter-rouge">wasm-ld --stack-first</code> layout,
where the ABI stack overflows off the bottom end of memory. The heap is
just excess memory beyond the data. To find the upper bound, Wasm has a
<a href="https://webassembly.github.io/spec/core/bikeshed/#syntax-instr-memory"><code class="language-plaintext highlighter-rouge">memory.size</code></a> instruction to query linear memory size, which again
Clang provides as an undocumented built-in:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="nf">__builtin_wasm_memory_size</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>Like before, this returns the result in number of 64k pages. That’s the
high end. How do we find the low end? Similar to <code class="language-plaintext highlighter-rouge">__stack_pointer</code>, the
linker creates a <code class="language-plaintext highlighter-rouge">__heap_base</code> constant, which is the address delineating
data and heap in the diagram above. To use it, we need to declare it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">char</span> <span class="n">__heap_base</span><span class="p">[];</span>
</code></pre></div></div>

<p>Notice how it’s an array, not a pointer. It doesn’t <em>hold</em> an address, it
<em>is</em> an address. In an ELF context this would called an <em>absolute symbol</em>.
That’s everything we need to find the bounds of the heap:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Arena</span> <span class="nf">getarena</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">__heap_base</span><span class="p">;</span>
    <span class="n">a</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)(</span><span class="n">__builtin_wasm_memory_size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then we continue forward using whatever memory the embedder deigned to
provide. Hopefully it’s enough!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Lessons learned from my first dive into WebAssembly</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/04/04/"/>
    <id>urn:uuid:9881d125-2f2c-4fee-a959-222c9449399b</id>
    <updated>2025-04-04T04:01:20Z</updated>
    <category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p>It began as a <a href="https://www.coolmathgames.com/blog/how-to-play-lipuzz-water-sort">water sort puzzle</a> solver, constructed similarly to
<a href="/blog/2020/10/19/">my British Square solver</a>. It was nearly playable, so I added a user
interface <a href="/blog/2023/01/08/">with SDL2</a>. My wife enjoyed it on her desktop, but wished
to play on her phone. So then I needed to either rewrite it in JavaScript
and hope the solver was still fast enough for real-time use, or figure out
WebAssembly (Wasm). I succeeded, and now <a href="/water-sort/">my game runs in browsers</a>
(<a href="https://github.com/skeeto/scratch/tree/master/water-sort">source</a>). Like <a href="/blog/2025/03/06/">before</a>, next I ported <a href="/blog/2023/01/18/">my pkg-config clone</a>
to the Wasm System Interface (<a href="https://wasi.dev/">WASI</a>), whipped up a proof-of-concept UI,
and <a href="https://skeeto.github.io/u-config/">it too runs in browsers</a>. Neither use a language runtime,
resulting in little 8kB and 28kB Wasm binaries respectively. In this
article I share my experiences and techniques.</p>

<p>Wasm is a <a href="https://webassembly.github.io/spec/">specification</a> defining an abstract stack machine with a
Harvard architecture, and related formats. There are just four types, i32,
i64, f32, and f64. It also has “linear” octet-addressable memory starting
at zero, with no alignment restrictions on loads and stores. Address zero
is a valid, writable address, which resurfaces some, old school, high
level language challenges regarding null pointers. There are 32-bit and
64-bit flavors, though the latter remains experimental. That suits me: I
appreciate smaller pointers on 64-bit hosts, and I wish I could opt into
it more often (e.g. x32).</p>

<p>As browser tech goes, they chose an apt name: WebAssembly is to the web as
JavaScript is to Java.</p>

<p>There are distinct components at play, and much of the online discussion
doesn’t do a great job drawing lines between them:</p>

<ul>
  <li>
    <p>Wasm module: A compiled and linked image — like ELF or PE — containing
sections for code, types, globals, import table, export table, and so
on. The export table lists the module’s entry points. It has an optional
<em>start section</em> indicating which function initializes a loaded image.
(In practice almost nobody actually uses the start section.) A Wasm
module can only affect the outside world through imported functions.
Wasm itself defines no external interfaces for Wasm programs, not even
printing or logging.</p>
  </li>
  <li>
    <p>Wasm runtime: Loads Wasm modules, linking import table entries into the
module. Because Wasm modules include types, the runtime can type check
this linkage at load time. With imports resolved, it executes the start
function, if any, then executes zero or more of its entry points, which
hopefully invokes import functions such a way as to produce useful
results, or perhaps simply return useful outputs.</p>
  </li>
  <li>
    <p>Wasm compiler: Converts a high-level language to low-level Wasm. In
order to do so, it requires some kind of Application Binary Interface
(ABI) to map the high-level language concepts onto the machine. This
typically introduces additional execution elements, and it’s important
that we distinguish them from the abstract machine’s execution elements.
Clang is the only compiler we’ll be discussing in this article, though
there are many. During compilation the <em>function indices</em> are yet
unknown and so references will need to be patched in by a linker.</p>
  </li>
  <li>
    <p>Wasm linker: Settles the shape of the Wasm module and links up the
functions emitted by the compiler. LLVM comes with <code class="language-plaintext highlighter-rouge">wasm-ld</code>, and it
goes hand-in-hand with Clang as a compiler.</p>
  </li>
  <li>
    <p>Language runtime: Unless you’re hand-writing raw Wasm, your high-level
language probably has a standard library with operating system
interfaces. C standard library, POSIX interfaces, etc. This runtime
likely maps onto some standardized set of imports, most likely the
aforementioned WASI, which defines a set of POSIX-like functions that
Wasm modules may import. Because I <a href="/blog/2023/02/11/">think we could do better</a>,
<a href="/blog/2023/02/15/">as usual</a> <a href="/blog/2023/03/23/">around here</a>, in this article we’re going to
eschew the language runtime and code directly against raw WASI. You
still have <a href="/blog/2025/01/19/">easy access hash tables and dynamic arrays</a>.</p>
  </li>
</ul>

<p>A combination of compiler-linker-runtime is conventionally called a
<em>toolchain</em>. However, because almost any Clang installation can target
Wasm out-of-the-box, and we’re skipping the language runtime, you can
compile any of programs discussed in this article, including my game, with
nothing more than Clang (invoking <code class="language-plaintext highlighter-rouge">wasm-ld</code> implicitly). If you have a
Wasm runtime, which includes your browser, you can run them, too! Though
this article will mostly focus on WASI, and you’ll need a WASI-capable
runtime to run those examples, which doesn’t include browsers (short of
implementing the API with JavaScript).</p>

<p>I wasn’t particularly happy with the Wasm runtimes I tried, so I cannot
enthusiastically recommend one. I’d love if I could point to one and say,
“Use the same Clang to compile the runtime that you’re using to compile
Wasm!” Alas, I had issues compiling, the runtime was buggy, or WASI was
incomplete. However, <a href="https://wazero.io/">wazero</a> (Go) was the easiest for me to use and it
worked well enough, so I will use it in examples:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go install github.com/tetratelabs/wazero/cmd/wazero@latest
</code></pre></div></div>

<p>The Wasm Binary Toolkit (<a href="https://github.com/WebAssembly/wabt">WABT</a>) is good to have on hand when working
with Wasm, particularly <code class="language-plaintext highlighter-rouge">wasm2wat</code> to inspect Wasm modules, sort of like
<code class="language-plaintext highlighter-rouge">objdump</code> or <code class="language-plaintext highlighter-rouge">readelf</code>. It converts Wasm to the WebAssembly Text Format
(WAT).</p>

<p>Learning Wasm I had quite some difficulty finding information. Outside of
the Wasm specification, which, despite its length, is merely a narrow
slice of the ecosystem, important technical details are scattered all over
the place. Some is only available as source code, some buried comments in
GitHub issues, and some lost behind dead links as repositories have moved.
Large parts of LLVM are undocumented beyond an mention of existence. WASI
has no documentation in a web-friendly format — so I have nothing to link
from here when I mention its system calls — just some IDL sources in a Git
repository. An old <a href="https://github.com/WebAssembly/wasi-libc/blob/e9524a09/libc-bottom-half/headers/public/wasi/api.h"><code class="language-plaintext highlighter-rouge">wasi.h</code></a> was the most readable, complete
source of truth I could find.</p>

<p>Fortunately Wasm is old enough that <a href="/blog/2024/11/10/">LLMs</a> are well-versed in it, and
simply asking questions, or for usage examples, was more effective than
searching online. If you’re stumped on how to achieve something in the
Wasm ecosystem, try asking a state-of-the-art LLM for help.</p>

<h3 id="example-programs">Example programs</h3>

<p>Let’s go over concrete examples to lay some foundations. Consider this
simple C function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">norm</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To compile to Wasm (32-bit) with Clang, we use the <code class="language-plaintext highlighter-rouge">--target=wasm32</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang -c --target=wasm32 -O example.c
</code></pre></div></div>

<p>The object file <code class="language-plaintext highlighter-rouge">example.o</code> is in Wasm format, so WABT can examine it.
Here’s the output of <code class="language-plaintext highlighter-rouge">wasm2wat -f</code>, where <code class="language-plaintext highlighter-rouge">-f</code> produces output in the
“folded” format, which is how I prefer to read it.</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">module</span>
  <span class="p">(</span><span class="nf">type</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="p">(</span><span class="nf">func</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">f32</span> <span class="nv">f32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">f32</span><span class="p">)))</span>
  <span class="p">(</span><span class="nf">import</span> <span class="s">"env"</span> <span class="s">"__linear_memory"</span> <span class="p">(</span><span class="nf">memory</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="mi">0</span><span class="p">))</span>
  <span class="p">(</span><span class="nf">func</span> <span class="nv">$norm</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">f32</span> <span class="nv">f32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">f32</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">f32</span><span class="o">.</span><span class="nv">add</span>
      <span class="p">(</span><span class="nf">f32</span><span class="o">.</span><span class="nv">mul</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">))</span>
      <span class="p">(</span><span class="nf">f32</span><span class="o">.</span><span class="nv">mul</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)))))</span>
</code></pre></div></div>

<p>We can see <a href="https://github.com/WebAssembly/tool-conventions/blob/main/BasicCABI.md">the ABI</a> taking shape: Clang has predictably mapped
<code class="language-plaintext highlighter-rouge">float</code> into <code class="language-plaintext highlighter-rouge">f32</code>. It similarly maps <code class="language-plaintext highlighter-rouge">char</code>, <code class="language-plaintext highlighter-rouge">short</code>, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">long</code>
onto <code class="language-plaintext highlighter-rouge">i32</code>. In 64-bit Wasm, the Clang ABI is LP64 and maps <code class="language-plaintext highlighter-rouge">long</code> onto
<code class="language-plaintext highlighter-rouge">i64</code>. There’s a also <code class="language-plaintext highlighter-rouge">$norm</code> function which takes two <code class="language-plaintext highlighter-rouge">f32</code> parameters
and returns an <code class="language-plaintext highlighter-rouge">f32</code>.</p>

<p>Getting a little more complex:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">import_name</span><span class="p">(</span><span class="s">"f"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="nf">f</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">export_name</span><span class="p">(</span><span class="s">"example"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">f</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">import_name</code> function attribute indicates the module will not define
it, even in another translation unit, and that it intends to import it.
That is, <code class="language-plaintext highlighter-rouge">wasm-ld</code> will place it in the import table. The <code class="language-plaintext highlighter-rouge">export_name</code>
function attribute indicates it’s an entry point, and so <code class="language-plaintext highlighter-rouge">wasm-ld</code> will
list it in the export table. Linking it will make things a little clearer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -nostdlib -Wl,--no-entry -O example.c
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-nostdlib</code> is because we won’t be using a language runtime, and
<code class="language-plaintext highlighter-rouge">--no-entry</code> to tell the linker not to implicitly export a function
(default: <code class="language-plaintext highlighter-rouge">_start</code>) as an entry point. You might think this is connected
with the Wasm <em>start function</em>, but <code class="language-plaintext highlighter-rouge">wasm-ld</code> does not support the <em>start
section</em> at all! We’ll have use for an entry point later. The folded WAT:</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">module</span> <span class="nv">$a</span><span class="o">.</span><span class="nv">out</span>
  <span class="p">(</span><span class="nf">type</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="p">(</span><span class="nf">func</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)))</span>
  <span class="p">(</span><span class="nf">import</span> <span class="s">"env"</span> <span class="s">"f"</span> <span class="p">(</span><span class="nf">func</span> <span class="nv">$f</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)))</span>
  <span class="p">(</span><span class="nf">func</span> <span class="nv">$example</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">local</span> <span class="nv">i32</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">global</span><span class="o">.</span><span class="nv">set</span> <span class="nv">$__stack_pointer</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">tee</span> <span class="mi">1</span>
        <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">sub</span>
          <span class="p">(</span><span class="nf">global</span><span class="o">.</span><span class="nv">get</span> <span class="nv">$__stack_pointer</span><span class="p">)</span>
          <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">16</span><span class="p">))))</span>
    <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">store</span> <span class="nv">offset=12</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">))</span>
    <span class="p">(</span><span class="nf">call</span> <span class="nv">$f</span>
      <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">add</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">12</span><span class="p">)))</span>
    <span class="p">(</span><span class="nf">global</span><span class="o">.</span><span class="nv">set</span> <span class="nv">$__stack_pointer</span>
      <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">add</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">16</span><span class="p">))))</span>
  <span class="p">(</span><span class="nf">table</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="mi">1</span> <span class="mi">1</span> <span class="nv">funcref</span><span class="p">)</span>
  <span class="p">(</span><span class="nf">memory</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="mi">2</span><span class="p">)</span>
  <span class="p">(</span><span class="nf">global</span> <span class="nv">$__stack_pointer</span> <span class="p">(</span><span class="nf">mut</span> <span class="nv">i32</span><span class="p">)</span> <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">66560</span><span class="p">))</span>
  <span class="p">(</span><span class="nf">export</span> <span class="s">"memory"</span> <span class="p">(</span><span class="nf">memory</span> <span class="mi">0</span><span class="p">))</span>
  <span class="p">(</span><span class="nf">export</span> <span class="s">"example"</span> <span class="p">(</span><span class="nf">func</span> <span class="nv">$example</span><span class="p">)))</span>
</code></pre></div></div>

<p>There’s a lot to unfold:</p>

<ul>
  <li>
    <p>Pointers were mapped onto <code class="language-plaintext highlighter-rouge">i32</code>. Pointers are a high-level concept, and
linear memory is addressed by an integral offset. This is typical of
assembly after all.</p>
  </li>
  <li>
    <p>There’s now a <code class="language-plaintext highlighter-rouge">__stack_pointer</code>, which is part of the Clang ABI, not
Wasm. The Wasm abstract machine is a stack machine, but that stack
doesn’t exist in linear memory. So you cannot take the address of values
on the Wasm stack. There are lots of things C needs from a stack that
Wasm doesn’t provide. So, <em>in addition to the Wasm stack</em>, Clang
maintains another downward-growing stack in linear memory for these
purposes, and the <code class="language-plaintext highlighter-rouge">__stack_pointer</code> global is the stack register of its
ABI. We can see it’s allocated something like 64kB for the stack. (It’s
a little more because program data is placed below the stack.)</p>
  </li>
  <li>
    <p>It should be mostly readable without knowing Wasm: The function
subtracts a 16-byte stack frame, stores a copy of the argument in it,
then uses its memory offset for the first parameter to the import <code class="language-plaintext highlighter-rouge">f</code>.
Why 16 bytes when it only needs 4? Because the stack is kept 16-byte
aligned. Before returning, the function restores the stack pointer.</p>
  </li>
</ul>

<p>As mentioned earlier, address zero is valid as far as the Wasm runtime is
concerned, though dereferences are still undefined in C. This makes it
more difficult to catch bugs. Given a null pointer this function would
most likely read a zero at address zero and the program keeps running:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">get</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In WAT:</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nf">func</span> <span class="nv">$get</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">i32</span><span class="p">)</span>
  <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">load</span>
    <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)))</span>
</code></pre></div></div>

<p>Since the “hardware” won’t fault for us, ask Clang to do it instead:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang ... -fsanitize=undefined -fsanitize-trap ...
</code></pre></div></div>

<p>Now in WAT:</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">module</span>
  <span class="p">(</span><span class="nf">type</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="p">(</span><span class="nf">func</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">i32</span><span class="p">)))</span>
  <span class="p">(</span><span class="nf">import</span> <span class="s">"env"</span> <span class="s">"__linear_memory"</span> <span class="p">(</span><span class="nf">memory</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="mi">0</span><span class="p">))</span>
  <span class="p">(</span><span class="nf">func</span> <span class="nv">$get</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span><span class="p">)</span> <span class="p">(</span><span class="nf">result</span> <span class="nv">i32</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">block</span>  <span class="c1">;; label = @1</span>
      <span class="p">(</span><span class="nf">block</span>  <span class="c1">;; label = @2</span>
        <span class="p">(</span><span class="nf">br_if</span> <span class="mi">0</span> <span class="p">(</span><span class="nf">;@2;</span><span class="p">)</span>
          <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">eqz</span>
            <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)))</span>
        <span class="p">(</span><span class="nf">br_if</span> <span class="mi">1</span> <span class="p">(</span><span class="nf">;@1;</span><span class="p">)</span>
          <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">eqz</span>
            <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">and</span>
              <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)</span>
              <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">3</span><span class="p">)))))</span>
      <span class="p">(</span><span class="nf">unreachable</span><span class="p">))</span>
    <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">load</span>
      <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">))))</span>
</code></pre></div></div>

<p>Given a null pointer, <code class="language-plaintext highlighter-rouge">get</code> executes the <code class="language-plaintext highlighter-rouge">unreachable</code> instruction,
causing the runtime to trap. In practice this is unrecoverable. Consider:
nothing will restore <code class="language-plaintext highlighter-rouge">__stack_pointer</code>, and so the stack will “leak” the
existing frames. (This can be worked around by exporting <code class="language-plaintext highlighter-rouge">__stack_pointer</code>
and <code class="language-plaintext highlighter-rouge">__stack_high</code> via the <code class="language-plaintext highlighter-rouge">--export</code> linker flag, then restoring the
stack pointer in the runtime after traps.)</p>

<p>Wasm was extended with <a href="https://github.com/WebAssembly/bulk-memory-operations">bulk memory operations</a>, and so there are
single instructions for <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memmove</code>, which Clang maps onto the
built-ins:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">clear</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">long</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__builtin_memset</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(<a href="https://releases.llvm.org/20.1.0/docs/ReleaseNotes.html#changes-to-the-webassembly-backend">Below LLVM 20</a> you will need the undocumented <code class="language-plaintext highlighter-rouge">-mbulk-memory</code>
option.) In WAT we see this as <code class="language-plaintext highlighter-rouge">memory.fill</code>:</p>

<div class="language-racket highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">module</span>
  <span class="p">(</span><span class="nf">type</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="p">(</span><span class="nf">func</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span> <span class="nv">i32</span><span class="p">)))</span>
  <span class="p">(</span><span class="nf">import</span> <span class="s">"env"</span> <span class="s">"__linear_memory"</span> <span class="p">(</span><span class="nf">memory</span> <span class="p">(</span><span class="nf">;0;</span><span class="p">)</span> <span class="mi">0</span><span class="p">))</span>
  <span class="p">(</span><span class="nf">func</span> <span class="nv">$clear</span> <span class="p">(</span><span class="nf">type</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="nf">param</span> <span class="nv">i32</span> <span class="nv">i32</span><span class="p">)</span>
    <span class="p">(</span><span class="nf">block</span>  <span class="c1">;; label = @1</span>
      <span class="p">(</span><span class="nf">br_if</span> <span class="mi">0</span> <span class="p">(</span><span class="nf">;@1;</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">eqz</span>
          <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)))</span>
      <span class="p">(</span><span class="nf">memory</span><span class="o">.</span><span class="nv">fill</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">0</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">i32</span><span class="o">.</span><span class="nv">const</span> <span class="mi">0</span><span class="p">)</span>
        <span class="p">(</span><span class="nf">local</span><span class="o">.</span><span class="nv">get</span> <span class="mi">1</span><span class="p">)))))</span>
</code></pre></div></div>

<p>That’s great! I wish this worked so well outside of Wasm. It’s one reason
<a href="https://github.com/skeeto/w64devkit">w64devkit</a> has <code class="language-plaintext highlighter-rouge">-lmemory</code>, after all. Similarly <code class="language-plaintext highlighter-rouge">__builtin_trap()</code> maps
onto the <code class="language-plaintext highlighter-rouge">unreachable</code> instruction, so we can reliably generate those as
well.</p>

<p>What about structures? They’re passed by address. Parameter structures go
on the stack, then its address passed. To return a structure, a function
accepts an implicit <em>out</em> parameter in which to write the return. This
isn’t unusual, except that it’s challenging to manage across module
boundaries, i.e. in imports and exports, because caller and callee are in
different address spaces. It’s especially tricky to return a structure
from an export, as the caller must somehow allocate space in the callee’s
address space for the result. The <a href="https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md">multi-value extension</a>
solves this, but using it in C involves an ABI change, which is still
experimental.</p>

<h3 id="water-sort-game">Water Sort Game</h3>

<p>Something you might not have expected: My water sort game imports no
functions! It only exports three functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>      <span class="nf">game_init</span><span class="p">(</span><span class="n">i32</span> <span class="n">seed</span><span class="p">);</span>
<span class="n">DrawList</span> <span class="o">*</span><span class="nf">game_render</span><span class="p">(</span><span class="n">i32</span> <span class="n">width</span><span class="p">,</span> <span class="n">i32</span> <span class="n">height</span><span class="p">,</span> <span class="n">i32</span> <span class="n">mousex</span><span class="p">,</span> <span class="n">i32</span> <span class="n">mousey</span><span class="p">);</span>
<span class="kt">void</span>      <span class="nf">game_update</span><span class="p">(</span><span class="n">i32</span> <span class="n">input</span><span class="p">,</span> <span class="n">i32</span> <span class="n">mousex</span><span class="p">,</span> <span class="n">i32</span> <span class="n">mousey</span><span class="p">,</span> <span class="n">i64</span> <span class="n">now</span><span class="p">);</span>
</code></pre></div></div>

<p>The game uses <a href="https://www.youtube.com/watch?v=DYWTw19_8r4">IMGUI-style</a> rendering. The caller passes in the
inputs, and the game returns a kind of <em>display list</em> telling it what to
draw. In the SDL version these turn into SDL renderer calls. In the web
version, these turn into canvas draws, and “mouse” inputs may be touch
events. It plays and feels the same on both platforms. Simple!</p>

<p>I didn’t realize it at the time, but building the SDL version first was
critical to my productivity. <strong>Debugging Wasm programs is really dang
hard!</strong> Wasm tooling has yet to catch up with 1995, let alone 2025.
Source-level debugging is still experimental and impractical. Developing
applications on the Wasm platform. It’s about as ergonomic as <a href="/blog/2018/04/13/">developing
in MS-DOS</a>. Instead, develop on a platform much better suited for
it, then <em>port</em> your application to Wasm after you’ve <a href="/blog/2025/02/05/">got the issues
worked out</a>. The less Wasm-specific code you write, the better, even
if it means writing more code overall. Treat it as you would some weird
embedded target.</p>

<p>The game comes with 10,000 seeds. I generated ~200 million puzzles, sorted
them by difficulty, and skimmed the top 10k most challenging. In the game
they’re still sorted by increading difficulty, so it gets harder as you
make progress.</p>

<h3 id="wasm-system-interface">Wasm System Interface</h3>

<p>WASI allows us to get a little more hands on. Let’s start with a Hello
World program. A WASI application exports a traditional <code class="language-plaintext highlighter-rouge">_start</code> entry
point which returns nothing and takes no arguments. I’m also going to set
up some basic typedefs:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">char</span>       <span class="n">u8</span><span class="p">;</span>
<span class="k">typedef</span>   <span class="kt">signed</span> <span class="kt">int</span>        <span class="n">i32</span><span class="p">;</span>
<span class="k">typedef</span>   <span class="kt">signed</span> <span class="kt">long</span> <span class="kt">long</span>  <span class="n">i64</span><span class="p">;</span>
<span class="k">typedef</span>   <span class="kt">signed</span> <span class="kt">long</span>       <span class="n">iz</span><span class="p">;</span>

<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">wasm-ld</code> will automatically export this function, so we don’t need an
<code class="language-plaintext highlighter-rouge">export_name</code> attribute. This program successfully does nothing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm &amp;&amp; echo ok
ok
</code></pre></div></div>

<p>To write output WASI defines <code class="language-plaintext highlighter-rouge">fd_write()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="n">iz</span>  <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">IoVec</span><span class="p">;</span>

<span class="cp">#define WASI(s) __attribute((import_module("wasi_unstable"),import_name(s)))
</span><span class="n">WASI</span><span class="p">(</span><span class="s">"fd_write"</span><span class="p">)</span>  <span class="n">i32</span>  <span class="nf">fd_write</span><span class="p">(</span><span class="n">i32</span><span class="p">,</span> <span class="n">IoVec</span> <span class="o">*</span><span class="p">,</span> <span class="n">iz</span><span class="p">,</span> <span class="n">iz</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Technically those <code class="language-plaintext highlighter-rouge">iz</code> variables are supposed to be <code class="language-plaintext highlighter-rouge">size_t</code>, passed
through Wasm as <code class="language-plaintext highlighter-rouge">i32</code>, but this is a foreign function, I know the ABI, and
so <a href="/blog/2023/05/31/">I can do as I please</a>. I absolutely love that WASI barely uses
null-terminated strings, not even for paths, which is a breath of fresh
air, but they still <a href="https://www.youtube.com/watch?v=wvtFGa6XJDU">marred the API with unsigned sizes</a>. Which I
choose to ignore.</p>

<p>This function is shaped like <a href="https://pubs.opengroup.org/onlinepubs/9799919799/functions/writev.html">POSIX <code class="language-plaintext highlighter-rouge">writev()</code></a>. I’ve also set it
up for import, including a module name. The oldest, most stable version of
WASI is called <code class="language-plaintext highlighter-rouge">wasi_unstable</code>. (I suppose it shouldn’t be surprising that
finding information in this ecosystem is difficult.)</p>

<p>Every returning WASI function returns an <code class="language-plaintext highlighter-rouge">errno</code> value, with zero as
success rather than some kind of <a href="/blog/2016/09/23/">in-band signaling</a>. Hence the
final out parameter unlike POSIX <code class="language-plaintext highlighter-rouge">writev()</code>.</p>

<p>Armed with this function, let’s use it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u8</span>    <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"hello world</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">IoVec</span> <span class="n">iov</span>   <span class="o">=</span> <span class="p">{</span><span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">};</span>
    <span class="n">iz</span>    <span class="n">len</span>   <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">fd_write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">iov</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm
hello world
</code></pre></div></div>

<p>Keep going and you’ll have <a href="/blog/2023/02/13/">something like <code class="language-plaintext highlighter-rouge">printf</code></a> before long. If
the write fails, we should probably communicate the error with at least
the exit status. Because <code class="language-plaintext highlighter-rouge">_start</code> doesn’t return a status, we need to
exit, for which we have <code class="language-plaintext highlighter-rouge">proc_exit</code>. It doesn’t return, so no <code class="language-plaintext highlighter-rouge">errno</code>
return value.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WASI</span><span class="p">(</span><span class="s">"proc_exit"</span><span class="p">)</span> <span class="kt">void</span> <span class="nf">proc_exit</span><span class="p">(</span><span class="n">i32</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">i32</span> <span class="n">err</span> <span class="o">=</span> <span class="n">fd_write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">iov</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">);</span>
    <span class="n">proc_exit</span><span class="p">(</span><span class="o">!!</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To get the command line arguments, call <code class="language-plaintext highlighter-rouge">args_sizes_get</code> to get the size,
allocate some memory, then <code class="language-plaintext highlighter-rouge">args_get</code> to read the arguments. Same goes for
the environment with a similar pair of functions. The sizes do not include
a null pointer terminator, which is sensible.</p>

<p>Now that you know how to find and use these functions, you don’t need me
to go through each one. However, <em>opening files</em> is a special, complicated
case:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WASI</span><span class="p">(</span><span class="s">"path_open"</span><span class="p">)</span> <span class="n">i32</span> <span class="nf">path_open</span><span class="p">(</span><span class="n">i32</span><span class="p">,</span><span class="n">i32</span><span class="p">,</span><span class="n">u8</span><span class="o">*</span><span class="p">,</span><span class="n">iz</span><span class="p">,</span><span class="n">i32</span><span class="p">,</span><span class="n">i64</span><span class="p">,</span><span class="n">i64</span><span class="p">,</span><span class="n">i32</span><span class="p">,</span><span class="n">i32</span><span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s 9 parameters — and I had thought <a href="https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew">Win32 <code class="language-plaintext highlighter-rouge">CreateFileW</code></a> was
over the top. It’s even more complex than it looks. It works more like
<a href="https://pubs.opengroup.org/onlinepubs/9799919799/functions/openat.html">POSIX <code class="language-plaintext highlighter-rouge">openat()</code></a>, except there’s no current working directory
and so no <code class="language-plaintext highlighter-rouge">AT_FDCWD</code>. Every file and directory is opened <em>relative to</em>
another directory, and absolute paths are invalid. If there’s no
<code class="language-plaintext highlighter-rouge">AT_FDCWD</code>, how does one open the <em>first</em> directory? That’s called a
<em>preopen</em> and it’s core to the file system security mechanism of WASI.</p>

<p>The Wasm runtime preopens zero or more directories before starting the
program and assigns them the lowest numbered file descriptors starting at
file descriptor 3 (after standard input, output, and error). A program
intending to use <code class="language-plaintext highlighter-rouge">path_open</code> must first traverse the file descriptors,
probing for preopens with <code class="language-plaintext highlighter-rouge">fd_prestat_get</code> and retrieving their path name
with <code class="language-plaintext highlighter-rouge">fd_prestat_dir_name</code>. This name may or may not map back onto a real
system path, and so this is a kind of virtual file system for the Wasm
module. The probe stops on the first error.</p>

<p>To open an absolute path, it must find a matching preopen, then from it
construct a path relative to that directory. This part I much dislike, as
the module must contain complex path parsing functionality even in the
simple case. Opening files is the most complex piece of the whole API.</p>

<p>I mentioned before that program data is below the Clang stack. With the
stack growing down, this sounds like a bad idea. A stack overflow quietly
clobbers your data, and is difficult to recognize. More sensible to put
the stack at the bottom so that it overflows off the bottom of memory and
causes a fast fault. Fortunately there’s a switch for that:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang --target=wasm32 ... -Wl,--stack-first ...
</code></pre></div></div>

<p>This is what you want by default. The actual default layout is left over
from an early design flaw in <code class="language-plaintext highlighter-rouge">wasm-ld</code>, and it’s an oversight that it has
not yet been corrected.</p>

<h3 id="u-config">u-config</h3>

<p>The above is in action in the <a href="https://github.com/skeeto/u-config/blob/0c86829e/main_wasm.c">u-config Wasm port</a>. You can download
the Wasm module, <a href="https://skeeto.github.io/u-config/pkg-config.wasm">pkg-config.wasm</a>, used in the web demo to run it in
your favorite WASI-capable Wasm runtime:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ wazero run pkg-config.wasm --modversion pkg-config
0.33.3
</code></pre></div></div>

<p>Though there are no preopens, so it cannot read any files. The <code class="language-plaintext highlighter-rouge">-mount</code>
option maps real file system paths to preopens. This mounts the entire
root file system read-only (<code class="language-plaintext highlighter-rouge">ro</code>) as <code class="language-plaintext highlighter-rouge">/</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ wazero run -mount /::ro pkg-config.wasm --cflags sdl2
-I/usr/include/SDL2 -D_REENTRANT
</code></pre></div></div>

<p>I doubt this is useful for anything, but it was a vehicle for learning and
trying Wasm, and the results are pretty neat.</p>

<p>In the next article I discuss <a href="/blog/2025/04/19/">allocating the allocator</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A more robust raw OpenBSD syscall demo</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/03/06/"/>
    <id>urn:uuid:f7101ee1-a2e6-4895-b763-bd7b2a842280</id>
    <updated>2025-03-06T02:43:20Z</updated>
    <category term="c"/><category term="bsd"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Ted Unangst published <a href="https://flak.tedunangst.com/post/dude-where-are-your-syscalls"><em>dude, where are your syscalls?</em></a> on flak
yesterday, with a neat demonstration of OpenBSD’s <a href="https://undeadly.org/cgi?action=article;sid=20230222064027">pinsyscall</a>
security feature, whereby only pre-registered addresses are allowed to
make system calls. Whether it strengthens or weakens security is <a href="https://isopenbsdsecu.re/mitigations/pinsyscall/">up for
debate</a>, but regardless it’s an interesting, low-level programming
challenge. The original demo is fragile for multiple reasons, and requires
manually locating and entering addresses for each build. In this article I
show how to fix it. To prove that it’s robust, I ported an entire, real
application to use raw system calls on OpenBSD.</p>

<p>The original program uses ARM64 assembly. I’m a lot more comfortable with
x86-64 assembly, plus that’s the hardware I have readily on hand. So the
assembly language will be different, but all the concepts apply to both
these architectures. Almost none of these OpenBSD system interfaces are
formally documented (or stable for that matter), and I had to dig around
the OpenBSD source tree to figure it out (along with a <a href="https://news.ycombinator.com/item?id=26290723">helpful jart
nudge</a>). So don’t be afraid to get your hands dirty.</p>

<p>There are lots of subtle problems in the original demo, so let’s go
through the program piece by piece, starting with the entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">start</span><span class="p">()</span>
<span class="p">{</span>
        <span class="n">w</span><span class="p">(</span><span class="s">"hello</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
        <span class="n">x</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function is registered as the entry point in the ELF image, so it has
no caller. <del>That means no return address on the stack, so the stack is
not aligned for a function.</del>(<strong>Correction</strong>: The stack alignment issue is
true for x86, but not ARM, so the original demo is fine.) In toy programs
that goes unnoticed, but compilers generate code assuming the stack is
aligned. In a real application this is likely to crash deep on the first
SIMD register spill.</p>

<p>We could fix this with a <a href="https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-force_005falign_005farg_005fpointer-function-attribute_002c-x86"><code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code></a> attribute, at
least for architectures that support it, but I prefer to write the entry
point in assembly. Especially so we can access the command line arguments
and environment variables, which is necessary in a real application. That
happens to work the same as it does on Linux, so here’s my old, familiar
entry point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">"        .globl _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start: mov   %rsp, %rdi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"        call  start</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Per the ABI, the first argument passes through <code class="language-plaintext highlighter-rouge">rdi</code>, so I pass a copy of
the stack pointer, <code class="language-plaintext highlighter-rouge">rsp</code>, as it appeared on entry. Entry point arguments
<code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> are all pushed on the stack at <code class="language-plaintext highlighter-rouge">rsp</code>, so the
first real function can retrieve it all from just the stack pointer. The
original demo won’t use it, though. Using <code class="language-plaintext highlighter-rouge">call</code> to pass control pushes a
return address, which will never be used, and aligns the stack for the
first real function. I name it <code class="language-plaintext highlighter-rouge">_start</code> because that’s what the linker
expects and so things will go a little smoother, so it’s rather convenient
that the original didn’t use this name.</p>

<p>Next up, the “write” function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="kr">__asm</span><span class="p">(</span>
<span class="s">"       mov x2, x1;"</span>
<span class="s">"       mov x1, x0;"</span>
<span class="s">"       mov w0, #1;"</span>
<span class="s">"       mov x8, #4;"</span>
<span class="s">"       svc #0;"</span>
        <span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There are two <a href="/blog/2024/12/20/">serious problems with this assembly block</a>. First, the
function arguments are not necessarily in those registers by the time
control reaches the basic assembly block. The function prologue could move
them around. Even more so if this function was inlined. This is exactly
the problem <em>extended</em> inline assembly is intended to solve. Second, it
clobbers a number of registers. Compilers assume this does not happen when
generating their own code. This sort of assembly falls apart the moment it
comes into contact with a non-zero optimization level.</p>

<p>Solving this is just a matter of using inline assembly properly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">long</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">err</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">rax</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>  <span class="c1">// SYS_write</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"+a"</span><span class="p">(</span><span class="n">rax</span><span class="p">),</span> <span class="s">"+d"</span><span class="p">(</span><span class="n">len</span><span class="p">),</span> <span class="s">"=@ccc"</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"D"</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">what</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">err</span> <span class="o">?</span> <span class="o">-</span><span class="n">rax</span> <span class="o">:</span> <span class="n">rax</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve enhanced it a bit, returning a <a href="/blog/2016/09/23/">Linux-style negative errno</a> on
error. In the BSD ecosystem, syscall errors are indicated using the carry
flag, which here is output into <code class="language-plaintext highlighter-rouge">err</code> via <code class="language-plaintext highlighter-rouge">=@ccc</code>. When set, the return
value is an errno. Further, the OpenBSD kernel uses both <code class="language-plaintext highlighter-rouge">rax</code> and <code class="language-plaintext highlighter-rouge">rdx</code>
for return values, so I’ve also listed <code class="language-plaintext highlighter-rouge">rdx</code> as an input+output despite
not consuming the result. Despite all these changes, this function is not
yet complete! We’ll get back to it later.</p>

<p>The “exit” function, <code class="language-plaintext highlighter-rouge">x</code>, is just fine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">x</span><span class="p">()</span> <span class="p">{</span>
        <span class="kr">__asm</span><span class="p">(</span>
<span class="s">"       mov x8, #1;"</span>
<span class="s">"       svc #0;"</span>
        <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It doesn’t set an exit status, so it passes garbage instead, but otherwise
this works. No inputs, plus clobbers and outputs don’t matter when control
never returns. In a real application I might write it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">x</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"syscall"</span> <span class="o">::</span> <span class="s">"a"</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">status</span><span class="p">));</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function will need a little additional work later, too.</p>

<p>The <code class="language-plaintext highlighter-rouge">ident</code> section is basically fine as-is:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span><span class="p">(</span><span class="s">" .section </span><span class="se">\"</span><span class="s">.note.openbsd.ident</span><span class="se">\"</span><span class="s">, </span><span class="se">\"</span><span class="s">a</span><span class="se">\"\n</span><span class="s">"</span>
<span class="s">"       .p2align 2</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   8</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   4</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .long   1</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .ascii </span><span class="se">\"</span><span class="s">OpenBSD</span><span class="se">\\</span><span class="s">0</span><span class="se">\"\n</span><span class="s">"</span>
<span class="s">"       .long   0</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"       .previous</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>The compiler assumes the current section remains the same at the end of
the assembly block, which here is accomplished with <code class="language-plaintext highlighter-rouge">.previous</code>. Though it
clobbers the assembler’s remembered “other” section and so may interfere
with surrounding code using <code class="language-plaintext highlighter-rouge">.previous</code>. Better to use <code class="language-plaintext highlighter-rouge">.pushsection</code> and
<code class="language-plaintext highlighter-rouge">.popsection</code> for good stack discipline. There are many such examples in
the OpenBSD source tree.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">".pushsection .note.openbsd.ident, </span><span class="se">\"</span><span class="s">a</span><span class="se">\"\n</span><span class="s">"</span>
    <span class="s">".long  8, 4, 1, 0x6e65704f, 0x00445342, 0</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".popsection</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Now the trickiest part, the pinsyscall table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">whats</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">sysno</span><span class="p">;</span>
<span class="p">}</span> <span class="n">happening</span><span class="p">[]</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">section</span><span class="p">(</span><span class="s">".openbsd.syscalls"</span><span class="p">)))</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">{</span> <span class="mh">0x104f4</span><span class="p">,</span> <span class="mi">4</span> <span class="p">},</span>
        <span class="p">{</span> <span class="mh">0x10530</span><span class="p">,</span> <span class="mi">1</span> <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Those offsets — offsets from the beginning of the ELF image — were entered
manually, and it kind of ruins the whole demo. We don’t have a good way to
get at those offsets from C, or any high level language. However, we can
solve that by tweaking the inline assembly with some labels:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="kt">long</span> <span class="nf">w</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">what</span><span class="p">,</span> <span class="kt">long</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"_w: syscall"</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noinline</span><span class="p">,</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">x</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"_x: syscall"</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Very importantly I’ve added <code class="language-plaintext highlighter-rouge">noinline</code> to prevent these functions from
being inlined into additional copies of the <code class="language-plaintext highlighter-rouge">syscall</code> instruction, which
of course won’t be registered. This also prevents duplicate labels causing
assembler errors. Once we have the labels, we can use them in an assembly
block listing the allowed syscall instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span>
    <span class="s">".pushsection .openbsd.syscalls</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".long  _x, 1</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".long  _w, 4</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">".popsection</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>
</code></pre></div></div>

<p>That lets the linker solve the offsets problem, which is its main job
after all. With these changes the demo works reliably, even under high
optimization levels. I suggest these flags:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostdlib -no-pie -o where where.c
</code></pre></div></div>

<p>Disabling PIE with <code class="language-plaintext highlighter-rouge">-no-pie</code> is necessary in real applications or else
strings won’t work. You can apply more flags to strip it down further, but
these are the flags generally necessary to compile these sorts of programs
on at least OpenBSD 7.6.</p>

<p>So, how do I know this stuff works in general? Because I ported <a href="/blog/2023/01/18/">my ultra
portable pkg-config clone, u-config</a>, to use raw OpenBSD syscalls:
<strong><a href="https://github.com/skeeto/u-config/blob/openbsd/openbsd_main.c"><code class="language-plaintext highlighter-rouge">openbsd_main.c</code></a></strong>. Everything still works at high optimization
levels.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c libmemory.a
$ ./pkg-config --cflags --libs libcurl
-I/usr/local/include -L/usr/local/lib -lcurl
</code></pre></div></div>

<p>Because the new syscall wrappers behave just like Linux system calls, it
leverages the <code class="language-plaintext highlighter-rouge">linux_noarch.c</code> platform, and the whole port is ~70 lines
of code. A few more flags (<code class="language-plaintext highlighter-rouge">-fno-stack-protector</code>, <code class="language-plaintext highlighter-rouge">-Oz</code>, <code class="language-plaintext highlighter-rouge">-s</code>, etc.), and
it squeezes into a slim 21.6K static binary.</p>

<p>Despite making no libc calls, it’s not possible stop compilers from
fabricating (<a href="/blog/2024/11/10/">hallucinating?</a>) string function calls, so the build
above depends on external definitions. In the command above, <code class="language-plaintext highlighter-rouge">libmemory.a</code>
comes from <a href="https://github.com/skeeto/w64devkit/blob/master/src/libmemory.c"><code class="language-plaintext highlighter-rouge">libmemory.c</code></a> found <a href="/blog/2024/02/05/">in w64devkit</a>. Alternatively,
<a href="https://flak.tedunangst.com/post/you-dont-link-all-of-libc">and on topic</a>, you could link the OpenBSD libc string functions by
omitting <code class="language-plaintext highlighter-rouge">libmemory.a</code> from the build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -static -nostartfiles -no-pie -o pkg-config openbsd_main.c
</code></pre></div></div>

<p>Though it pulls in a lot of bloat (~8x size increase), and teasing out the
necessary objects isn’t trivial.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Robust Wavefront OBJ model parsing in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/03/02/"/>
    <id>urn:uuid:852fe937-3510-4752-a9a8-97fde5321e7e</id>
    <updated>2025-03-02T23:22:58Z</updated>
    <category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><a href="https://en.wikipedia.org/wiki/Wavefront_.obj_file">Wavefront OBJ</a> is a line-oriented, text format for 3D geometry. It’s
widely supported by modeling software, easy to parse, and trivial to emit,
much like <a href="/blog/2017/11/03/">Netpbm for 2D image data</a>. Poke around hobby 3D graphics
projects and you’re likely to find a bespoke OBJ parser. While typically
only loading their own model data, so robustness doesn’t much matter, they
usually have hard limitations and don’t stand up to <a href="/blog/2025/02/05/">fuzz testing</a>.
This article presents a robust, partial OBJ parser in C with no hard-coded
limitations, written from scratch. Like <a href="/blog/2025/01/19/">similar articles</a>, it’s not
<em>really</em> about OBJ but demonstrating some techniques you’ve probably never
seen before.</p>

<p>If you’d like to see the ready-to-run full source: <a href="https://github.com/skeeto/scratch/blob/master/misc/objrender.c"><code class="language-plaintext highlighter-rouge">objrender.c</code></a>.
All images are screenshots of this program.</p>

<p>First let’s establish the requirements. By <em>robust</em> I mean no undefined
behavior for any input, valid or invalid; no out of bounds accesses, no
signed overflows. Input is otherwise not validated. Invalid input may load
as valid by chance, which will render as either garbage or nothing. The
behavior will also not vary by locale.</p>

<p>We’re also only worried about vertices, normals, and triangle faces with
normals. In OBJ these are <code class="language-plaintext highlighter-rouge">v</code>, <code class="language-plaintext highlighter-rouge">vn</code>, and <code class="language-plaintext highlighter-rouge">f</code> elements. Normals let us
light the model effectively while checking our work. A cube fitting this
subset of OBJ might look like:</p>

<pre><code class="language-obj">v  -1.00 -1.00 -1.00
v  -1.00 +1.00 -1.00
v  +1.00 +1.00 -1.00
v  +1.00 -1.00 -1.00
v  -1.00 -1.00 +1.00
v  -1.00 +1.00 +1.00
v  +1.00 +1.00 +1.00
v  +1.00 -1.00 +1.00

vn +1.00  0.00  0.00
vn -1.00  0.00  0.00
vn  0.00 +1.00  0.00
vn  0.00 -1.00  0.00
vn  0.00  0.00 +1.00
vn  0.00  0.00 -1.00

f   3//1  7//1  8//1
f   3//1  8//1  4//1
f   1//2  5//2  6//2
f   1//2  6//2  2//2
f   7//3  3//3  2//3
f   7//3  2//3  6//3
f   4//4  8//4  5//4
f   4//4  5//4  1//4
f   8//5  7//5  6//5
f   8//5  6//5  5//5
f   3//6  4//6  1//6
f   3//6  1//6  2//6
</code></pre>

<p><img src="/img/objrender/cube.png" alt="" /></p>

<p>Take note:</p>

<ul>
  <li>Some fields are separated by more than one space.</li>
  <li>Vertices and normals are fractional (floating point).</li>
  <li>Faces use 1-indexing instead of 0-indexing.</li>
  <li>Faces in this model lack a texture index, hence <code class="language-plaintext highlighter-rouge">//</code> (empty).</li>
</ul>

<p>Inputs may have other data, but we’ll skip over it, including face texture
indices, or face elements beyond the third. Some of the models I’d like to
test have <em>relative</em> indices, so I want to support those, too. A relative
index refers <em>backwards</em> from the last vertex, so the order of the lines
in an OBJ matter. For example, the cube faces above could have instead
been written:</p>

<pre><code class="language-obj">f  -6//-6 -2//-6 -1//-6
f  -6//-6 -1//-6 -5//-6
f  -8//-5 -4//-5 -3//-5
f  -8//-5 -3//-5 -7//-5
f  -2//-4 -6//-4 -7//-4
f  -2//-4 -7//-4 -3//-4
f  -5//-3 -1//-3 -4//-3
f  -5//-3 -4//-3 -8//-3
f  -1//-2 -2//-2 -3//-2
f  -1//-2 -3//-2 -4//-2
f  -6//-1 -5//-1 -8//-1
f  -6//-1 -8//-1 -7//-1
</code></pre>

<p>Due to this the parser cannot be blind to line order, and it must handle
negative indices. Relative indexing has the nice effect that we can group
faces, and those groups are <em>relocatable</em>. We can reorder them without
renumbering the faces, or concatenate models just by concatenating their
OBJ files.</p>

<h3 id="the-fundamentals">The fundamentals</h3>

<p>To start off, we’ll be <a href="/blog/2023/09/27/">using an arena</a> of course, trivializing
memory management while swiping aside all hard-coded limits. A quick
reminder of the interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="c1">// Always returns an aligned pointer inside the arena. Allocations are</span>
<span class="c1">// zeroed. Does not return on OOM (never returns a null pointer).</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">align</span><span class="p">);</span>
</code></pre></div></div>

<p>Also, no null terminated strings, perhaps the main source of problems with
bespoke parsers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)    (Str){s, sizeof(s)-1}
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>
</code></pre></div></div>

<p>Pointer arithmetic is error prone, so the tricky stuff is relegated to a
handful of functions, each of which can be exhaustively validated almost
at a glance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">span</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">beg</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">len</span>  <span class="o">=</span> <span class="n">beg</span> <span class="o">?</span> <span class="n">end</span><span class="o">-</span><span class="n">beg</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">_Bool</span> <span class="nf">equals</span><span class="p">(</span><span class="n">Str</span> <span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="o">==</span><span class="n">b</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="o">!</span><span class="n">a</span><span class="p">.</span><span class="n">len</span> <span class="o">||</span> <span class="o">!</span><span class="n">memcmp</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="p">));</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">trimleft</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="o">&lt;=</span><span class="sc">' '</span><span class="p">;</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="o">++</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="o">--</span><span class="p">)</span> <span class="p">{}</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">trimright</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">&lt;=</span><span class="sc">' '</span><span class="p">;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="o">--</span><span class="p">)</span> <span class="p">{}</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">Str</span> <span class="nf">substring</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">s</span><span class="p">.</span><span class="n">data</span> <span class="o">+=</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">s</span><span class="p">.</span><span class="n">len</span>  <span class="o">-=</span> <span class="n">i</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Each avoids the purposeless special cases around null pointers (i.e.
zero-initialized <code class="language-plaintext highlighter-rouge">Str</code> objects) that would otherwise work out naturally.
The space character and all control characters are treated as whitespace
for simplicity. When I started writing this parser, I didn’t define all
these functions up front. I defined them as needed. (A <a href="/blog/2023/02/11/">good standard
library</a> would have provided similar definitions out-of-the-box.) If
you’re worried about misuse, add the appropriate assertions.</p>

<p>A powerful and useful string function I’ve discovered, and which I use in
every string-heavy program, is <code class="language-plaintext highlighter-rouge">cut</code>, a concept I shamelessly stole <a href="https://pkg.go.dev/strings#Cut">from
the Go standard library</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span>   <span class="n">head</span><span class="p">;</span>
    <span class="n">Str</span>   <span class="n">tail</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Cut</span><span class="p">;</span>

<span class="n">Cut</span> <span class="nf">cut</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">char</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Cut</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// null pointer special case</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">cut</span> <span class="o">=</span> <span class="n">beg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">cut</span><span class="o">&lt;</span><span class="n">end</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">cut</span><span class="o">!=</span><span class="n">c</span><span class="p">;</span> <span class="n">cut</span><span class="o">++</span><span class="p">)</span> <span class="p">{}</span>
    <span class="n">r</span><span class="p">.</span><span class="n">ok</span>   <span class="o">=</span> <span class="n">cut</span> <span class="o">&lt;</span> <span class="n">end</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">head</span> <span class="o">=</span> <span class="n">span</span><span class="p">(</span><span class="n">beg</span><span class="p">,</span> <span class="n">cut</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">span</span><span class="p">(</span><span class="n">cut</span><span class="o">+</span><span class="n">r</span><span class="p">.</span><span class="n">ok</span><span class="p">,</span> <span class="n">end</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It slices, it dices, it juliennes! Need to iterate over lines? Cut it up:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Cut</span> <span class="n">c</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">c</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">input</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">c</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="n">Str</span> <span class="n">line</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="c1">// ... process line ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Need to iterate over the fields in a line? Cut the line on the field
separator. Then cut the field on the element separator. No allocation, no
mutation (<code class="language-plaintext highlighter-rouge">strtok</code>).</p>

<h3 id="reading-input">Reading input</h3>

<p>Unlike <a href="/blog/2025/02/17/">a program designed to process arbitrarily large inputs</a>, the
intention here is to load the entire model into memory. We don’t need to
fiddle around with loading a line of input at at time (<code class="language-plaintext highlighter-rouge">fgets</code>, <code class="language-plaintext highlighter-rouge">getline</code>,
etc.) — the usual approach with OBJ parsers. If the OBJ source cannot fit
in memory, then the model won’t fit in memory. This greatly simplifies the
parser, not to mention faster while lifting hard-coded limits like maximum
line length.</p>

<p>The simple arena I use makes whole-file loading <em>so easy</em>. Read straight
into the arena without checking the file size (<code class="language-plaintext highlighter-rouge">ftell</code>, etc.), which means
streaming inputs (i.e. pipes) work automatically.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">loadfile</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span>  <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">len</span>  <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">len</span>  <span class="o">=</span> <span class="n">fread</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Without buffered input, you may need a loop around the read:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">loadfile</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">fd</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">ptrdiff_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">cap</span><span class="o">-</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// ignoring read errors</span>
        <span class="p">}</span>
        <span class="n">r</span><span class="p">.</span><span class="n">len</span> <span class="o">+=</span> <span class="n">r</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You might consider triggering an out-of-memory error if the arena was
filled to the brim, which almost certainly means the input was truncated.
Though that’s likely to happen anyway because the next allocation from
that arena will fail.</p>

<p>Side note: When using a multi GB arena, issuing such huge read requests
stress tests the underlying IO system. I’ve found libc bugs this way. In
this case I <a href="/blog/2023/01/08/">used SDL2</a> for the demo, and SDL lost the ability to
read files after I increased the arena size to 4GB in order to test a
<a href="https://casual-effects.com/data/">gigantic model</a> (“Power Plant”). I’ve run into this before, and
I assumed it was another Microsoft CRT bug. After investigating deeper for
this article, I learned it’s an ancient SDL bug that’s made it all the way
into SDL3. <code class="language-plaintext highlighter-rouge">-Wconversion</code> warns about it, but <a href="https://github.com/libsdl-org/SDL-historical-archive/commit/e6ab3592e">was accidentally squelched
in the 64-bit port back in 2009</a>. It seems nobody else loads files
this way, so watch out for platform bugs if you use this technique!</p>

<h3 id="parsing-data">Parsing data</h3>

<p>In practice, rendering systems limit counts to the 32-bit range, which is
reasonable. So in the OBJ parser, vertex and normal indices will be 32-bit
integers. Negatives will be needed for at least relative indexing. Parsing
from a <code class="language-plaintext highlighter-rouge">Str</code> means null-terminated functions like <code class="language-plaintext highlighter-rouge">strtol</code> are off limits.
So here’s a function to parse a signed integer out of a <code class="language-plaintext highlighter-rouge">Str</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">parseint</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span>    <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">sign</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
        <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>            <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span> <span class="n">sign</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">default</span> <span class="o">:</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">10</span><span class="o">*</span><span class="n">r</span> <span class="o">+</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="sc">'0'</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span> <span class="o">*</span> <span class="n">sign</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">uint32_t</code> means its free to overflow. If it overflows, the input was
invalid. If it doesn’t hold an integer, the input was invalid. In either
case it will read a harmless, garbage result. Despite being unsigned, it
works just fine with negative inputs thanks to two’s complement.</p>

<p>For floats I didn’t intend to parse exponential notation, but some models
I wanted to test actually <em>did</em> use it — probably by accident — so I added
it anyway. That requires a function to compute the exponent.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">expt10</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span>   <span class="n">y</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
    <span class="kt">float</span>   <span class="n">x</span> <span class="o">=</span> <span class="n">e</span><span class="o">&lt;</span><span class="mi">0</span> <span class="o">?</span> <span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span> <span class="o">:</span> <span class="n">e</span><span class="o">&gt;</span><span class="mi">0</span> <span class="o">?</span> <span class="mi">10</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">n</span> <span class="o">=</span> <span class="n">e</span><span class="o">&lt;</span><span class="mi">0</span> <span class="o">?</span> <span class="n">e</span> <span class="o">:</span> <span class="o">-</span><span class="n">e</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">n</span> <span class="o">/=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">y</span> <span class="o">*=</span> <span class="n">n</span><span class="o">%</span><span class="mi">2</span> <span class="o">?</span> <span class="n">x</span> <span class="o">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">*=</span> <span class="n">x</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s exponentiation by squaring, <a href="/blog/2024/05/24/">avoiding signed overflow</a> on the
exponent. Traditionally a negative exponent is inverted, but applying
unary <code class="language-plaintext highlighter-rouge">-</code> to an arbitrary integer might overflow (consider -2147483648).
So instead I iterate from the negative end. The negative range is larger
than the positive, after all. Finally we can parse floats:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">parsefloat</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">r</span>    <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">sign</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">exp</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
        <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>            <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span> <span class="n">sign</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="sc">'.'</span><span class="p">:</span> <span class="n">exp</span>  <span class="o">=</span>  <span class="mi">1</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="sc">'E'</span><span class="p">:</span>
        <span class="k">case</span> <span class="sc">'e'</span><span class="p">:</span> <span class="n">exp</span>  <span class="o">=</span> <span class="n">exp</span> <span class="o">?</span> <span class="n">exp</span> <span class="o">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
                  <span class="n">exp</span> <span class="o">*=</span> <span class="n">expt10</span><span class="p">(</span><span class="n">parseint</span><span class="p">(</span><span class="n">substring</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)));</span>
                  <span class="n">i</span>    <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span>
                  <span class="k">break</span><span class="p">;</span>
        <span class="k">default</span> <span class="o">:</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">10</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="o">*</span><span class="n">r</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="sc">'0'</span><span class="p">);</span>
                  <span class="n">exp</span> <span class="o">*=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">sign</span> <span class="o">*</span> <span class="n">r</span> <span class="o">*</span> <span class="p">(</span><span class="n">exp</span> <span class="o">?</span> <span class="n">exp</span> <span class="o">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Probably not as precise as <code class="language-plaintext highlighter-rouge">strtof</code>, but good enough for loading a model.
It’s also ~30% faster for this purpose than my system’s <code class="language-plaintext highlighter-rouge">strtof</code>. If it
hits an exponent, it combines <code class="language-plaintext highlighter-rouge">parseint</code> and <code class="language-plaintext highlighter-rouge">expt10</code> to augment the
result so far. At least for all the models I tried, the exponent only
appeared for tiny values. They round to zero with no visible effects, so
you can cut the implementation by more than half in one fell swoop if you
wish (no more <code class="language-plaintext highlighter-rouge">expt10</code> nor <code class="language-plaintext highlighter-rouge">substring</code> either):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="k">switch</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="k">case</span> <span class="sc">'E'</span><span class="p">:</span>
        <span class="k">case</span> <span class="sc">'e'</span><span class="p">:</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// probably small *shrug*</span>
        <span class="c1">// ...</span>
        <span class="p">}</span>
</code></pre></div></div>

<p>Why not <code class="language-plaintext highlighter-rouge">strtof</code>? That has the rather annoying requirement that input is
null terminated, which is not the case here. Worse, it’s <a href="https://github.com/mpv-player/mpv/commit/1e70e82b">affected by the
locale</a> and doesn’t behave consistently nor reliably.</p>

<p>A vertex is three floats separated by whitespace. So combine <code class="language-plaintext highlighter-rouge">cut</code> and
<code class="language-plaintext highlighter-rouge">parsefloat</code> to parse one.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">v</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
<span class="p">}</span> <span class="n">Vert</span><span class="p">;</span>

<span class="n">Vert</span> <span class="nf">parsevert</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Vert</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">Cut</span> <span class="n">c</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimleft</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">parsefloat</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">head</span><span class="p">);</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimleft</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">tail</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">parsefloat</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">head</span><span class="p">);</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimleft</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">tail</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">parsefloat</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">head</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">cut</code> parses a field between every space, including empty fields between
adjacent spaces, so <code class="language-plaintext highlighter-rouge">trimleft</code> discards extra space before cutting. If the
line ends early, this passes empty strings into <code class="language-plaintext highlighter-rouge">parsefloat</code> which come
out as zeros. No special checks required for invalid input.</p>

<p>Faces are a set of three vertex indices and three normal indices, and
parses almost the same way. Relative indices are immediately converted to
absolute indices using the number of vertices/normals so far.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">v</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">n</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
<span class="p">}</span> <span class="n">Face</span><span class="p">;</span>

<span class="k">static</span> <span class="n">Face</span> <span class="nf">parseface</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">nverts</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">nnorms</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Face</span> <span class="n">r</span>      <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">Cut</span>  <span class="n">fields</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">fields</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fields</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimleft</span><span class="p">(</span><span class="n">fields</span><span class="p">.</span><span class="n">tail</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
        <span class="n">Cut</span> <span class="n">elem</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">fields</span><span class="p">.</span><span class="n">head</span><span class="p">,</span> <span class="sc">'/'</span><span class="p">);</span>
        <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">parseint</span><span class="p">(</span><span class="n">elem</span><span class="p">.</span><span class="n">head</span><span class="p">);</span>
        <span class="n">elem</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">elem</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'/'</span><span class="p">);</span>  <span class="c1">// skip texture</span>
        <span class="n">elem</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">elem</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'/'</span><span class="p">);</span>
        <span class="n">r</span><span class="p">.</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">parseint</span><span class="p">(</span><span class="n">elem</span><span class="p">.</span><span class="n">head</span><span class="p">);</span>

        <span class="c1">// Process relative subscripts</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">nverts</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">r</span><span class="p">.</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">nnorms</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">nverts</code> must be non-negative, and a relative index is negative by
definition, adding them together can never overflow. If there are too many
vertices, the result might be truncated, as indicated by the cast. That’s
fine. Just invalid input.</p>

<p>There’s an interesting interview question here: Consider this alternative
to the above, maintaining the explicit cast to dismiss the <code class="language-plaintext highlighter-rouge">-Wconversion</code>
warning.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            <span class="n">r</span><span class="p">.</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">nverts</span><span class="p">);</span>
</code></pre></div></div>

<p>Is it equivalent? Can this overflow? (Answers: No and yes.) If yes, under
what conditions? Unfortunately a fuzz test would never hit it.</p>

<h3 id="putting-it-together">Putting it together</h3>

<p>For this case, a model is three arrays of vertices, normals, and indices.
While faces only support 32-bit indexing, I use <code class="language-plaintext highlighter-rouge">ptrdiff_t</code> in order to
skip overflow checks. There cannot possibly be more vertices than bytes of
source, so these counts cannot overflow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Vert</span>     <span class="o">*</span><span class="n">verts</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">nverts</span><span class="p">;</span>
    <span class="n">Vert</span>     <span class="o">*</span><span class="n">norms</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">nnorms</span><span class="p">;</span>
    <span class="n">Face</span>     <span class="o">*</span><span class="n">faces</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">nfaces</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Model</span><span class="p">;</span>

<span class="n">Model</span> <span class="nf">parseobj</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Str</span><span class="p">);</span>
</code></pre></div></div>

<p>They’d probably look a little nicer as <a href="/blog/2023/10/05/">dynamic arrays</a>, but we won’t
need that machinery. That’s because the parser makes two passes over the
OBJ source, the first time to count:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Model</span> <span class="n">m</span>     <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">Cut</span>   <span class="n">lines</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

    <span class="n">lines</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">lines</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="n">Cut</span> <span class="n">fields</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimright</span><span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">head</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
        <span class="n">Str</span> <span class="n">kind</span> <span class="o">=</span> <span class="n">fields</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"v"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">nverts</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"vn"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">nnorms</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"f"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">nfaces</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>It’s a lightweight pass, skipping over the numeric data. With that
information collected, we can allocate the model:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">m</span><span class="p">.</span><span class="n">verts</span>  <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nverts</span><span class="p">,</span> <span class="n">Vert</span><span class="p">);</span>
    <span class="n">m</span><span class="p">.</span><span class="n">norms</span>  <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nnorms</span><span class="p">,</span> <span class="n">Vert</span><span class="p">);</span>
    <span class="n">m</span><span class="p">.</span><span class="n">faces</span>  <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nfaces</span><span class="p">,</span> <span class="n">Face</span><span class="p">);</span>
    <span class="n">m</span><span class="p">.</span><span class="n">nverts</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">nnorms</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">nfaces</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>On the next pass we call <code class="language-plaintext highlighter-rouge">parsevert</code> and <code class="language-plaintext highlighter-rouge">parseface</code> to fill it out.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">lines</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">lines</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="n">Cut</span> <span class="n">fields</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">trimright</span><span class="p">(</span><span class="n">lines</span><span class="p">.</span><span class="n">head</span><span class="p">),</span> <span class="sc">' '</span><span class="p">);</span>
        <span class="n">Str</span> <span class="n">kind</span> <span class="o">=</span> <span class="n">fields</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"v"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">verts</span><span class="p">[</span><span class="n">m</span><span class="p">.</span><span class="n">nverts</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">parsevert</span><span class="p">(</span><span class="n">fields</span><span class="p">.</span><span class="n">tail</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"vn"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">norms</span><span class="p">[</span><span class="n">m</span><span class="p">.</span><span class="n">nnorms</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">parsevert</span><span class="p">(</span><span class="n">fields</span><span class="p">.</span><span class="n">tail</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"f"</span><span class="p">),</span> <span class="n">kind</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">m</span><span class="p">.</span><span class="n">nfaces</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">parseface</span><span class="p">(</span><span class="n">fields</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nverts</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nnorms</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>At this point the model is parsed, though its not necessarily consistent.
Faces indices may still be out of range. The next step is to transform it
into a more useful representation.</p>

<h3 id="transformation">Transformation</h3>

<p>Rendering the model is the easiest way to verify it came out alright, and
it’s generally useful for debugging problems. Because it basically does
all the hard work for us, and doesn’t require <a href="https://www.khronos.org/opengl/wiki/OpenGL_Loading_Library">ridiculous contortions to
access</a>, I’m going to render with old school OpenGL 1.1. It provides a
<a href="https://registry.khronos.org/OpenGL-Refpages/gl2.1/xhtml/glInterleavedArrays.xml"><code class="language-plaintext highlighter-rouge">glInterleavedArrays</code></a> function with a bunch of predefined formats.
The one that interests me is <code class="language-plaintext highlighter-rouge">GL_N3F_V3F</code>, where each vertex is a normal
and a position. Each face is three such elements. I came up with this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>  <span class="c1">// GL_N3F_V3F</span>
    <span class="n">Vert</span> <span class="n">n</span><span class="p">,</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span> <span class="n">N3FV3F</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">N3FV3F</span>   <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">N3FV3Fs</span><span class="p">;</span>

<span class="c1">// Transform a model into a GL_N3F_V3F representation.</span>
<span class="n">N3FV3Fs</span> <span class="nf">n3fv3fize</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">Model</span><span class="p">);</span>
</code></pre></div></div>

<p>If you’re being precise you’d use <code class="language-plaintext highlighter-rouge">GLfloat</code>, but this is good enough for
me. By using a different arena for this step, we can discard the OBJ data
once it’s in the “local” format. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Arena</span> <span class="n">perm</span>    <span class="o">=</span> <span class="p">{...};</span>
    <span class="n">Arena</span> <span class="n">scratch</span> <span class="o">=</span> <span class="p">{...};</span>

    <span class="n">N3FV3Fs</span> <span class="o">*</span><span class="n">scene</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="o">&amp;</span><span class="n">perm</span><span class="p">,</span> <span class="n">nmodels</span><span class="p">,</span> <span class="n">N3FV3Fs</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nmodels</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Arena</span> <span class="n">temp</span>  <span class="o">=</span> <span class="n">scratch</span><span class="p">;</span>  <span class="c1">// free OBJ at end of iteration</span>
        <span class="n">Str</span>   <span class="n">obj</span>   <span class="o">=</span> <span class="n">loadfile</span><span class="p">(</span><span class="o">&amp;</span><span class="n">temp</span><span class="p">,</span> <span class="n">path</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="n">Model</span> <span class="n">model</span> <span class="o">=</span> <span class="n">parseobj</span><span class="p">(</span><span class="o">&amp;</span><span class="n">temp</span><span class="p">,</span> <span class="n">obj</span><span class="p">);</span>
        <span class="n">scene</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>    <span class="o">=</span> <span class="n">n3fv3fize</span><span class="p">(</span><span class="o">&amp;</span><span class="n">perm</span><span class="p">,</span> <span class="n">model</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The conversion allocates the <code class="language-plaintext highlighter-rouge">GL_N3F_V3F</code> array, discards invalid faces,
and copies the valid faces into the array:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">N3FV3Fs</span> <span class="nf">n3fv3fize</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Model</span> <span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">N3FV3Fs</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">nfaces</span><span class="p">,</span> <span class="n">N3FV3F</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">f</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">f</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">.</span><span class="n">nfaces</span><span class="p">;</span> <span class="n">f</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">_Bool</span> <span class="n">valid</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">valid</span> <span class="o">&amp;=</span> <span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&gt;</span><span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&lt;=</span><span class="n">m</span><span class="p">.</span><span class="n">nverts</span><span class="p">;</span>
            <span class="n">valid</span> <span class="o">&amp;=</span> <span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&gt;</span><span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&lt;=</span><span class="n">m</span><span class="p">.</span><span class="n">nnorms</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">valid</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">ptrdiff_t</span> <span class="n">t</span> <span class="o">=</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">t</span><span class="p">][</span><span class="n">i</span><span class="p">].</span><span class="n">n</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">norms</span><span class="p">[</span><span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">n</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
                <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">t</span><span class="p">][</span><span class="n">i</span><span class="p">].</span><span class="n">v</span> <span class="o">=</span> <span class="n">m</span><span class="p">.</span><span class="n">verts</span><span class="p">[</span><span class="n">m</span><span class="p">.</span><span class="n">faces</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s what that looks like in OpenGL with <a href="https://chuck.stanford.edu/chugl/examples/data/models/suzanne.obj"><code class="language-plaintext highlighter-rouge">suzanne.obj</code></a> and
<a href="https://casual-effects.com/data/"><code class="language-plaintext highlighter-rouge">bmw.obj</code></a>:</p>

<p><img src="/img/objrender/suzanne.png" alt="" /></p>

<p><img src="/img/objrender/bmw.png" alt="" /></p>

<p>This was a fun little project, and perhaps you learned a new technique or
two after checking it out.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Meet the new xxd for w64devkit: rexxd</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/02/17/"/>
    <id>urn:uuid:a3ad2465-f53c-43d3-acc7-988d9d4d3989</id>
    <updated>2025-02-17T00:49:49Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>xxd is a versatile hexdump utility with a “reverse” feature, originally
written between 1990–1996. The Vim project soon adopted it, and it’s lived
there ever since. If you have Vim, you also have xxd. Its primary use
cases are (1) the basis for a hex editor due to its <code class="language-plaintext highlighter-rouge">-r</code> reverse option
that can <em>unhexdump</em> its previous output, and (2) a data embedding tool
for C and C++ (<code class="language-plaintext highlighter-rouge">-i</code>). The former provides Vim’s rudimentary hex editor
functionality. The second case is of special interest to <a href="https://github.com/skeeto/w64devkit">w64devkit</a>:
<code class="language-plaintext highlighter-rouge">xxd -i</code> appears in many builds that <a href="/blog/2016/11/15/">embed arbitrary data</a>. It’s
important that w64devkit has a compatible implementation, and a freshly
rewritten, improved xxd, <strong><a href="https://github.com/skeeto/w64devkit/blob/master/src/rexxd.c">rexxd</a></strong>, now replaces the original xxd (as
<code class="language-plaintext highlighter-rouge">xxd</code>).</p>

<p>For those unfamiliar with xxd, examples are in order. Its default hexdump
output looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world | xxd | tee dump
00000000: 6865 6c6c 6f20 776f 726c 640a            hello world.
</code></pre></div></div>

<p>Octets display in pairs with an ASCII text listing on the right. All
configurable. I can run this in reverse (<code class="language-plaintext highlighter-rouge">-r</code>), recovering the original
input:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ xxd -r dump
hello world
</code></pre></div></div>

<p>The tool reads the offset before the colon, the hexadecimal octets, and
ignores the text column. By editing <code class="language-plaintext highlighter-rouge">dump</code> with a text editor, I can
change the raw octets of the original input. From this point of view, the
hexdump is actually a program of two alternating instructions: seek and
write. xxd <em>seeks</em> to the offset, <em>writes</em> the octets, then repeats. It
also doesn’t truncate the output file, so a hexdump can express binary
patches as a seek/write program.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello
$ echo 6: 65766572796f6e650a | xxd -r - hello
$ cat hello
hello everyone
</code></pre></div></div>

<p>That seeks to offset <code class="language-plaintext highlighter-rouge">0x6</code>, then writes the 9 octets. The xxd parser is
flexible, and I did not need to follow the default format. It figured out
the format on its own, and rexxd further improves on this. We can use it
to create large files out of thin air, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo 3fffffff: 00 | xxd -r - &gt;1G
</code></pre></div></div>

<p>This command creates an all-zero, 1GiB file, <code class="language-plaintext highlighter-rouge">1G</code>, by seeking to just
before 1GiB then writing a zero. I used <code class="language-plaintext highlighter-rouge">&gt;1G</code> so that the shell would
truncate the file before starting <code class="language-plaintext highlighter-rouge">xxd</code> — in case it was larger or
contained non-zeros.</p>

<p>This is a “smart seek” of course, and its not literally seeking on every
line. The tool tracks its file position and only seeks when necessary. If
seeking fails, it simulates the seek using a write if possible. When would
it not be possible? Lines need not be in order, of course, and so it may
need to seek backwards. Lines can also overlap in contents. If it weren’t
for buffering — or if rexxd had a <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/UnifiedBufferCache">unified buffer cache</a> — then by
using the same file for input and output an “xxd program” could write new
instructions for itself and <a href="/blog/2016/04/30/">accidentally become Turing-complete</a>.</p>

<p>The other common mode, <code class="language-plaintext highlighter-rouge">-i</code>, looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello
$ xxd -i hello hello.c
</code></pre></div></div>

<p>Which produces this <code class="language-plaintext highlighter-rouge">hello.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">hello</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="mh">0x68</span><span class="p">,</span> <span class="mh">0x65</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x6f</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="mh">0x77</span><span class="p">,</span> <span class="mh">0x6f</span><span class="p">,</span> <span class="mh">0x72</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x64</span><span class="p">,</span> <span class="mh">0x0a</span>
<span class="p">};</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">hello_len</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span>
</code></pre></div></div>

<p>Note how it converted the file name into variable names. Characters
disallowed in variable names become underscores <code class="language-plaintext highlighter-rouge">_</code>. When reading from
standard input, xxd only emits the octets. Unless the new-ish <code class="language-plaintext highlighter-rouge">-n</code> name
option is given, in which case that becomes the variable name. This
remains popular because, <a href="https://en.cppreference.com/w/c/preprocessor/embed"><code class="language-plaintext highlighter-rouge">#embed</code></a> notwithstanding, as of this
writing all major toolchains remain stubborn about embedding data on their
own.</p>

<h3 id="the-case-for-replacement">The case for replacement</h3>

<p>The idea of replacing it began with backporting the <code class="language-plaintext highlighter-rouge">-n</code> name option to
Vim 9.0 xxd. The feature did not appear in a release until a year ago, 28
years after <code class="language-plaintext highlighter-rouge">-i</code>, despite its obviousness. I’ve also felt that <code class="language-plaintext highlighter-rouge">xxd</code> is
slower than it could be, and a momentary examination reveals it’s buggier
than it ought to be. <a href="/blog/2025/02/05/">As expected</a>, a few seconds of fuzz testing
<code class="language-plaintext highlighter-rouge">xxd -r</code> reveals bugs, and it doesn’t even require writing a single line
of code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ afl-gcc -fsanitize=address,undefined xxd.c
$ mkdir inputs
$ echo &gt;inputs/sample
$ afl-fuzz -i inputs/ -o fuzzout/ ./a.out -r
</code></pre></div></div>

<p>The Windows port is lacking in the usual ways, unable to handle Unicode
paths. The new Vim 9.1 xxd <code class="language-plaintext highlighter-rouge">-R</code> color feature broke the Windows port, and
if w64devkit included Vim 9.1 then I’d need to patch out the new bugs. As
demonstrated above, at least it’s trivial to compile! It’s a single source
file, <code class="language-plaintext highlighter-rouge">xxd.c</code>, and requires no configuration. I love that.</p>

<p>The more I looked, the more problems I found. It’s not doing anything
terribly complex, so I expected it wouldn’t be difficult to rewrite it
with a better foundation. So I did. Ignoring tests and documentation, my
rewrite is about twice as long. In exchange, it’s <em>substantially faster</em>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dd if=/dev/urandom of=bigfile bs=1M count=64

$ time orig-xxd bigfile dump
real    0m 4.40s
user    0m 2.89s
sys     0m 1.46s

$ time rexxd bigfile dump
real    0m 0.31s
user    0m 0.07s
sys     0m 0.21s
</code></pre></div></div>

<p>Same in reverse:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time orig-xxd -r dump nul
real    0m 5.81s
user    0m 5.67s
sys     0m 0.07s

$ time rexxd -r dump nul
real    0m 0.33s
user    0m 0.23s
sys     0m 0.09s
</code></pre></div></div>

<p>Or embedding data with rexxd:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time orig-xxd -i bigfile bigfile.c
real    0m 10.32s
user    0m 9.85s
sys     0m 0.37s

$ time rexxd -i bigfile bigfile.c
real    0m 0.40s
user    0m 0.07s
sys     0m 0.34s
</code></pre></div></div>

<p>I wanted to keep it portable and simple, so that’s without <a href="/blog/2021/12/04/">fancy SIMD
processing</a>. Just <a href="/blog/2022/04/30/">SWAR parsing</a>, <a href="/blog/2017/10/06/">branch avoidance</a>,
no division on hot paths, and sound architecture. I also optimized for the
typical case at the cost of the atypical case. It’s a little unfair to
compare it to a program probably first written on a 16-bit machine, but
there was time for it to pick up these techniques over the decades, too.</p>

<p>Unicode support works well:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat π
3.14159265358979323846264338327950288419716939937510582097494
$ rexxd -i π π.c
</code></pre></div></div>

<p>Producing this source with Unicode variables:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="err">π</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x2e</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0x39</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x36</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span>
  <span class="c1">// ...</span>
  <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x0a</span>
<span class="p">};</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="err">π</span><span class="n">_len</span> <span class="o">=</span> <span class="mi">62</span><span class="p">;</span>
</code></pre></div></div>

<p>Whereas the original xxd on Windows has the <a href="/blog/2021/12/30/">usual CRT problems</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ orig-xxd -i π
orig-xxd: p: No such file or directory
</code></pre></div></div>

<p>It also struggles with 64-bit offsets, particularly on 32-bit hosts and
LLP64 hosts like Windows. In contrast, I designed rexxd to robustly
process file offsets as 64-bit on all hosts. Its tests operate on a
virtual file system with virtual files at those sizes, so those paths
really have been tested, too.</p>

<p>The original xxd only uses static allocation, which places small range
limits on the configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ orig-xxd -c 1000
orig-xxd: invalid number of columns (max. 256).
</code></pre></div></div>

<p>In rexxd everything is <a href="/blog/2023/09/27/">arena allocated</a> of course, and options are
limited only by the available memory, so the above, and more, would work.
The arena helps make the SWAR tricks possible, too, providing a fast
runway to load more data at a time.</p>

<p>While reverse engineering the original, I documented bugs I discovered and
noted them with a <code class="language-plaintext highlighter-rouge">BUG:</code> comment if you wanted to see more. I’m not aiming
for bug compatibility, so these are not present in rexxd.</p>

<h3 id="platform-layer">Platform layer</h3>

<p>The <a href="https://manpages.debian.org/bookworm/xxd/xxd.1.en.html">xxd man page</a> suggests using strace to examine the execution of
<code class="language-plaintext highlighter-rouge">-r</code> reverse. That is, to monitor the seeks and writes of a binary patch
in order to debug it. That’s so insightful that I decided to build that as
a new <code class="language-plaintext highlighter-rouge">-x</code> option (think <code class="language-plaintext highlighter-rouge">sh -x</code>). That is, <em>rexxd has a built-in strace
on all platforms!</em> The trace is expressed in terms of unix system calls,
even on Windows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ printf '00:41 \n02:42 \n04:43' | rexxd -x -r - data.bin
open("data.bin", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 19
write(1, "A", 1) = 1
lseek(1, 2, SEEK_SET) = 2
read(0, ..., 4096) = 0
write(1, "B", 1) = 1
lseek(1, 4, SEEK_SET) = 4
write(1, "C", 1) = 1
exit(0) = ?
</code></pre></div></div>

<p>Is this doing some kind of self-<a href="/blog/2018/06/23/">ptrace</a> debugger voodoo? Nope. Like
<a href="/blog/2023/01/18/">u-config</a>, it has a <em>platform layer</em>, and it simply logs the platform
layer calls — except for the trace printout itself of course. While the
intention is to debug binary patches, it was also quite insightful in
examining rexxd itself. It helped me spot that rexxd flushed more often
than strictly necessary.</p>

<p>To port rexxd to any system, define <code class="language-plaintext highlighter-rouge">Plt</code> as needed, implement these five
<code class="language-plaintext highlighter-rouge">plt_</code> functions, then call <code class="language-plaintext highlighter-rouge">xxd</code>. The five functions mostly have the
expected unix-like semantics:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Plt</span> <span class="n">Plt</span><span class="p">;</span>
<span class="n">b32</span>  <span class="nf">plt_open</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="n">b32</span> <span class="n">trunc</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">i64</span>  <span class="nf">plt_seek</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">i64</span> <span class="n">off</span><span class="p">,</span> <span class="n">i32</span> <span class="n">whence</span><span class="p">);</span>
<span class="n">i32</span>  <span class="nf">plt_read</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">);</span>
<span class="n">b32</span>  <span class="nf">plt_write</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">plt_exit</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span><span class="p">);</span>
<span class="n">i32</span>  <span class="nf">xxd</span><span class="p">(</span><span class="n">i32</span> <span class="n">argc</span><span class="p">,</span> <span class="n">u8</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">byte</span> <span class="o">*</span><span class="n">heap</span><span class="p">,</span> <span class="n">iz</span> <span class="n">heapsize</span><span class="p">);</span>
</code></pre></div></div>

<p>If the platform wants these functions to be “virtual” then it can put
function pointers in the <code class="language-plaintext highlighter-rouge">Plt</code> struct. Otherwise it stores anything it
might need in <code class="language-plaintext highlighter-rouge">Plt</code>. Global variables are never necessary. The application
layer doesn’t use the standard library except (indirectly) <code class="language-plaintext highlighter-rouge">memset</code> and
<code class="language-plaintext highlighter-rouge">memcpy</code>, and it allocates everything it uses from the provided <code class="language-plaintext highlighter-rouge">heap</code>
parameter.</p>

<p><code class="language-plaintext highlighter-rouge">plt_open</code> is a little unusual in that it picks the file descriptor: 0 to
replace standard input, or 1 to replace standard output. All platforms
currently use a virtual file descriptor table, and these do not map onto
the real process file descriptors. But they could! Calls are straced in
the application layer, so they log virtual file descriptors as seen by
rexxd. The arena parameter offers scratch space for the Windows platform
layer to convert paths from narrow to wide for <code class="language-plaintext highlighter-rouge">CreateFileW</code>, so it can
handle <a href="https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">long path names</a> with ease.</p>

<p><code class="language-plaintext highlighter-rouge">plt_read</code> doesn’t accept a file descriptor because there’s only one from
which to read, 0. <code class="language-plaintext highlighter-rouge">plt_write</code> on the other hand allows writing to standard
error, 2.</p>

<p><code class="language-plaintext highlighter-rouge">plt_exit</code> doesn’t return, of course. In tests it <a href="/blog/2023/02/12/">longjmps</a> back
to the top level, as though returning from <code class="language-plaintext highlighter-rouge">xxd</code> with a status. This lets
me skip allocation null pointer checks, with OOM unwinding safely back to
the top level. Since rexxd allocates everything from the arena, it’s all
automatically deallocated, so it’s a clean exit.</p>

<p>On Windows, <code class="language-plaintext highlighter-rouge">plt_seek</code> calls <a href="https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setfilepointerex"><code class="language-plaintext highlighter-rouge">SetFilePointerEx</code></a>. I learned the
hard way that the behavior of calling it on a non-file is undefined, not
an error, so at least one <code class="language-plaintext highlighter-rouge">GetFileType</code> call is mandatory. I also learned
that Windows will successfully seek all the way to <code class="language-plaintext highlighter-rouge">INT64_MAX</code>. If the
file system doesn’t support that offset, it’s a write failure <em>later</em>. For
correct operation, rexxd must take care not to overflow its own internal
file position tracking near these offsets with Windows allowing seeks to
operate at the edge until the first flush. Tests run on a virtual file
system thanks to the platform layer, and some tests permit huge seeks and
simulate impossibly enormous files in order to probe behavior at the
extremes.</p>

<p>This is in contrast to Linux, where seeks beyond the underlying file
system’s supported file size is a seek error. For example, on ext4 with
the default configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo ffffffff000: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040320, SEEK_SET) = 17592186040320
read(0, ..., 4096) = 0
write(1, "\0", 1) = -1
exit(3) = ?
</code></pre></div></div>

<p>We can see the seek succeeded then the write failed because it went one
byte beyond the file system limit. While seeking one byte further will
cause the seek to fail (22 <code class="language-plaintext highlighter-rouge">EINVAL</code>), and rexxd falls back on write until
it fills the storage and runs out of space:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo ffffffff001: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040321, SEEK_SET) = -1
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
...
</code></pre></div></div>

<p>Mostly for fun, I wrote a libc-free platform layer using <a href="/blog/2023/03/23/">raw Linux system
calls</a>, and it maps <em>almost</em> perfectly onto the kernel interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">Plt</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">fds</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span> <span class="p">};</span>

<span class="n">b32</span> <span class="nf">plt_open</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="n">b32</span> <span class="n">trunc</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">i32</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">fd</span> <span class="o">?</span> <span class="n">O_CREAT</span><span class="o">|</span><span class="n">O_WRONLY</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">mode</span> <span class="o">|=</span> <span class="n">trunc</span> <span class="o">?</span> <span class="n">O_TRUNC</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_open</span><span class="p">,</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">path</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="mo">0666</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">i64</span> <span class="nf">plt_seek</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">i64</span> <span class="n">off</span><span class="p">,</span> <span class="n">i32</span> <span class="n">whence</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_lseek</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">],</span> <span class="n">off</span><span class="p">,</span> <span class="n">whence</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">i32</span> <span class="nf">plt_read</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_read</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">b32</span> <span class="nf">plt_write</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="n">plt</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="n">i32</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">len</span> <span class="o">==</span> <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_write</span><span class="p">,</span> <span class="n">plt</span><span class="o">-&gt;</span><span class="n">fds</span><span class="p">[</span><span class="n">fd</span><span class="p">],</span> <span class="p">(</span><span class="n">uz</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">plt_exit</span><span class="p">(</span><span class="n">Plt</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall3</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On Windows I use the <a href="/blog/2023/05/31/">artisanal function prototypes</a> of which I’ve
grown so fond. It’s also my first time using w64devkit’s <code class="language-plaintext highlighter-rouge">-lmemory</code> in a
serious application. I’m using <a href="/blog/2024/02/05/"><code class="language-plaintext highlighter-rouge">-lchkstk</code></a> in the “xxd as a DLL”
platform layer, too, but that one’s just a toy. In that one I use <code class="language-plaintext highlighter-rouge">alloca</code>
to allocate an arena, which is a rather novel combination, and the large
stack frame requires a stack probe. Otherwise none of rexxd requires stack
probes.</p>

<p>w64devkit’s new <code class="language-plaintext highlighter-rouge">xxd.exe</code> is delightfully tidy as viewed by <a href="/blog/2024/06/30/">peports</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ du -h xxd.exe
28.0K   xxd.exe
$ peports xxd.exe
KERNEL32.dll
        0       CreateFileW
        0       ExitProcess
        0       GetCommandLineW
        0       GetFileType
        0       GetStdHandle
        0       MultiByteToWideChar
        0       ReadFile
        0       SetFilePointerEx
        0       VirtualAlloc
        0       WideCharToMultiByte
        0       WriteFile
SHELL32.dll
        0       CommandLineToArgvW
</code></pre></div></div>

<h3 id="other-notes">Other notes</h3>

<p><a href="/blog/2023/02/13/">Buffered output</a> and buffered input is custom tailored for rexxd.
When parsing line-oriented input, like <code class="language-plaintext highlighter-rouge">-r</code>, it attempts to parse from of
a <em>view</em> of the input buffer, no copying. The view is the <a href="/blog/2025/01/19/">usual string
representation</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>typedef struct {
    u8 *data;
    iz  len;
} Str;
</code></pre></div></div>

<p>Does it fail if the line is longer than the buffer? If it straddles reads,
does that hurt efficiency? The answer to both is “no” due to the spillover
arena. <code class="language-plaintext highlighter-rouge">Input</code> is the buffered input struct, and here’s the interface to
get the next line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Str nextline(Input *, Arena *);
</code></pre></div></div>

<p>If the line isn’t entirely contained in the input buffer, the complete
line is <a href="/blog/2024/05/25/">concatenated</a> in the arena. So it comfortably handles
huge lines while no-copy optimizing for typical short, non-straddling
lines. With a per-iteration arena, any arena-backed line is automatically
freed at the end of the iteration, so it’s all transparent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">Arena</span> <span class="n">scratch</span> <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
        <span class="n">Str</span> <span class="n">line</span> <span class="o">=</span> <span class="n">nextline</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
        <span class="c1">// ... line may point into an Input or scratch ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If the line doesn’t fit in the arena, it triggers OOM handling. That is,
it calls <code class="language-plaintext highlighter-rouge">plt_exit</code> and something platform-appropriate happens without
returning. Beats the pants off <a href="https://man7.org/linux/man-pages/man3/getline.3.html">old <code class="language-plaintext highlighter-rouge">getline</code></a>!</p>

<p>I came up with a <code class="language-plaintext highlighter-rouge">maxof</code> macro that evaluates the maximum of any integral
type, signed or unsigned. It appears in <a href="/blog/2024/05/24/">overflow checks</a> and more,
I really like how it turned out. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">pos</span> <span class="o">&gt;</span> <span class="n">maxof</span><span class="p">(</span><span class="n">i64</span><span class="p">)</span> <span class="o">-</span> <span class="n">off</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// overflow</span>
    <span class="p">}</span>
    <span class="n">pos</span> <span class="o">+=</span> <span class="n">off</span><span class="p">;</span>
</code></pre></div></div>

<p>Or:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i32</span> <span class="nf">trunc32</span><span class="p">(</span><span class="n">iz</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">n</span><span class="o">&gt;</span><span class="n">maxof</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span> <span class="o">?</span> <span class="n">maxof</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span> <span class="o">:</span> <span class="p">(</span><span class="n">i32</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now that I have <code class="language-plaintext highlighter-rouge">-lmemory</code> and generally solved string function issues for
myself, I leaned into <code class="language-plaintext highlighter-rouge">__builtin_memset</code> and <code class="language-plaintext highlighter-rouge">__builtin_memcpy</code> for this
project. Despite <code class="language-plaintext highlighter-rouge">restrict</code>, it’s surprisingly difficult to get compilers
to optimize loops into semantically equivalent string function calls. An
explicit built-in solves that. It also produces faster debug builds, which
is what I run while I work. At <code class="language-plaintext highlighter-rouge">-O0</code>, rexxd is about half the speed of a
release build.</p>

<p>Other than <code class="language-plaintext highlighter-rouge">-x</code>, I don’t plan on inventing new features. I’d like to
maintain compatibility with the <code class="language-plaintext highlighter-rouge">xxd</code> found everywhere else, and I don’t
expect adoption beyond w64devkit. Overall the project took about twice as
long as I anticipated — two weekends instead of one — but it turned out
better than I expected and I’m very pleased with the results.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Tips for more effective fuzz testing with AFL++</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/02/05/"/>
    <id>urn:uuid:eff3b773-99ee-4c38-9f9c-f51294a1b9e0</id>
    <updated>2025-02-05T18:03:55Z</updated>
    <category term="c"/><category term="cpp"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p>Fuzz testing is incredibly effective for mechanically discovering software
defects, yet remains underused and neglected. Pick any program that must
gracefully accept complex input, written <em>in any language</em>, which has not
yet been been fuzzed, and fuzz testing usually reveals at least one bug.
At least one program currently installed on your own computer certainly
qualifies. Perhaps even most of them. <a href="https://danluu.com/everything-is-broken/">Everything is broken</a> and
low-hanging fruit is everywhere. After fuzz testing ~1,000 projects <a href="/blog/2019/01/25/">over
the past six years</a>, I’ve accumulated tips for picking that fruit.
The checklist format has worked well in the past (<a href="/blog/2024/12/20/">1</a>, <a href="/blog/2023/01/08/">2</a>), so
I’ll use it again. This article discusses <a href="https://aflplus.plus/">AFL++</a> on source-available
C and C++ targets, running on glibc-based Linux distributions, currently
the <em>indisputable</em> best fuzzing platform for C and C++.</p>

<p>My tips complement the official, upstream documentation, so consult them,
too:</p>

<ul>
  <li><a href="https://afl-1.readthedocs.io/en/latest/tips.html">Performance Tips</a> on the AFL++ website</li>
  <li><a href="https://lcamtuf.coredump.cx/afl/technical_details.txt">Technical “whitepaper” for afl-fuzz</a></li>
</ul>

<p>Even if a program has been fuzz tested, applying the techniques in this
article may reveal defects missed by previous fuzz testing.</p>

<h3 id="1-configure-sanitizers-and-assertions">(1) Configure sanitizers and assertions</h3>

<p>More assertions means more effective fuzzing, and sanitizers are a kind of
automatically-inserted assertions. By default, fuzz with both Address
Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ afl-gcc-fast -g3 -fsanitize=address,undefined ...
</code></pre></div></div>

<p>ASan’s default configuration is not ideal, and should be adjusted via the
<code class="language-plaintext highlighter-rouge">ASAN_OPTIONS</code> environment variable. If customized at all, AFL++ requires
at least these options:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export ASAN_OPTIONS="abort_on_error=1:halt_on_error=1:symbolize=0"
</code></pre></div></div>

<p>Except <code class="language-plaintext highlighter-rouge">symbolize=0</code>, <a href="/blog/2022/06/26/">this <em>ought to be</em> the ASan default</a>. When
debugging a discovered crash, you’ll want UBSan set up the same way so
that it behaves under in a debugger. To improve fuzzing, make ASan even
more sensitive to defects by detecting use-after-return bugs. It slows
fuzzing slightly, but it’s well worth the cost:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ASAN_OPTIONS+=":detect_stack_use_after_return=1"
</code></pre></div></div>

<p>By default ASan fills the first 4KiB of fresh allocations with a pattern,
to help detect use-after-free bugs. That’s not nearly enough for fuzzing.
Crank it up to completely fill virtually all allocations with a pattern:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ASAN_OPTIONS+=":max_malloc_fill_size=$((1&lt;&lt;30))"
</code></pre></div></div>

<p>In the default configuration, if a program allocates more than 4KiB with
<code class="language-plaintext highlighter-rouge">malloc</code> then, say, uses <code class="language-plaintext highlighter-rouge">strlen</code> on the uninitialized memory, no bug will
be detected. There’s almost certainly a zero somewhere after 4KiB. Until I
noticed it, the 4KiB limit hid a number of bugs from my fuzz testing. Per
(4), fulling filling allocations with a pattern better isolates tests when
using persistent mode.</p>

<p>When fuzzing C++ and linking GCC’s libstdc++, consider <code class="language-plaintext highlighter-rouge">-D_GLIBCXX_DEBUG</code>.
ASan cannot “see” out-of-bounds accesses within a container’s capacity,
and the extra assertions fill in the gaps. Mind that it changes the ABI,
though fuzz testing will instantly highlight such mismatches.</p>

<h3 id="2-prefer-the-persistent-mode">(2) Prefer the persistent mode</h3>

<p>While AFL++ can fuzz many programs in-place without writing a single line
of code (<code class="language-plaintext highlighter-rouge">afl-gcc</code>, <code class="language-plaintext highlighter-rouge">afl-clang</code>), prefer AFL++’s <a href="https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md">persistent mode</a>
(<code class="language-plaintext highlighter-rouge">afl-gcc-fast</code>, <code class="language-plaintext highlighter-rouge">afl-clang-fast</code>). It’s typically an order of magnitude
faster and worth the effort. Though it also has pitfalls (see (4), (5)). I
keep a file on hand, <code class="language-plaintext highlighter-rouge">fuzztmpl.c</code> — the progenitor of all my fuzz testers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="n">__AFL_FUZZ_INIT</span><span class="p">();</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__AFL_INIT</span><span class="p">();</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">src</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">__AFL_FUZZ_TESTCASE_BUF</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">__AFL_LOOP</span><span class="p">(</span><span class="mi">10000</span><span class="p">))</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">__AFL_FUZZ_TESTCASE_LEN</span><span class="p">;</span>
        <span class="n">src</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="c1">// ... send src to target ...</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I <a href="https://vimhelp.org/insert.txt.html#%3Aread"><code class="language-plaintext highlighter-rouge">:r</code></a> this into my Vim buffer, then modify as needed. It’s a
stripped and improved version of the official template, which itself has a
serious flaw (see (5)). There are unstated constraints about the position
of <code class="language-plaintext highlighter-rouge">buf</code> and <code class="language-plaintext highlighter-rouge">len</code> in the code, so if in doubt, refer to the original
template.</p>

<h3 id="3-include-source-files-not-header-files">(3) Include source files, not header files</h3>

<p>We’re well into the 21st century. Nobody is compiling software on 16-bit
machines anymore. Don’t get hung up on the one translation unit (TU) per
source file mindset. When fuzz testing, we need at most two TUs: One TU
for instrumented code and one TU for uninstrumented code. In most cases
the latter takes the form of a library (libc, libstdc++, etc.) and we
don’t need to think about it.</p>

<p>Fuzz testing typically requires only a subset of the program. Including
just those sources straight in the template is both effective and simple.
In my template I put includes just <em>above</em> <code class="language-plaintext highlighter-rouge">unistd.h</code> so that the header
isn’t visible to the sources unless they include it themselves.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"src/utils.c"</span><span class="cp">
#include</span> <span class="cpf">"src/parser.c"</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>I know, if you’ve never seen this before it looks bonkers. This isn’t what
they taught you in college. Trust me, <a href="https://en.wikipedia.org/wiki/Unity_build">this simple technique</a> will
save you a thousand lines of build configuration. Otherwise you’ll need to
manage different object files between fuzz testing and otherwise.</p>

<p>Perhaps more importantly, you can now fuzz test <em>any arbitrary function</em>
in the program, including static functions! They’re all right there in the
same TU. You’re not limited to public-facing interfaces. Perhaps you can
skip (7) and test against a better internal interface. It also gives you
direct access to static variables so that you can clear/reset them between
tests, per (4).</p>

<p>Programs are often not designed for fuzz testing, or testing generally,
and it may be difficult to tease apart tightly-coupled components. Many of
the programs I’ve fuzz tested look like this. This technique lets you take
a hacksaw to the program and substitute troublesome symbols just for fuzz
testing without modifying a single original source line. For example, if
the source I’m testing contains a <code class="language-plaintext highlighter-rouge">main</code> function, I can remove it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define main oldmain
#  include "src/utils.c"
#  include "src/parser.c"
#undef main
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>Sure, better to improve the program so that such hacks are unnecessary,
but most cases I’m fuzz testing as part of a drive-by review of some open
source project. It allows me to quickly discover defects in the original,
unmodified program, and produces simpler bug reports like, “Compile with
ASan, open this 50-byte file, and then the program will crash.”</p>

<h3 id="4-isolate-fuzz-tests-from-each-other">(4) Isolate fuzz tests from each other</h3>

<p>Tests should be unaffected by previous tests. This is challenging in
persistent mode, sometimes even impractical. That means resetting all
global state, even something like the internal <code class="language-plaintext highlighter-rouge">strtok</code> buffer if that
function is used. Add fuzz testing to your list of reasons to eschew
global variables.</p>

<p>It’s mitigated by (1), but otherwise uninitialized heap memory may hold
contents from previous tests, breaking isolation. Besides interference
with fuzzing instrumentation, bugs found this way are wickedly difficult
to reproduce.</p>

<p>Don’t pass uninitialized memory into a test, e.g. an output parameter
allocated on the stack. Zero-initialize or fill it with a pattern. If it
accepts an arena, fill it with a pattern before each test.</p>

<p>Typically you have little control over heap addresses, which likely varies
across tests and depends on the behavior previous tests. If the program
<a href="/blog/2025/01/19/#hash-hardening-bonus">depends on address values</a>, this may affect the results and make
reproduction difficult, so watch for that.</p>

<h3 id="5-do-not-test-directly-on-the-fuzz-test-buffer">(5) Do not test directly on the fuzz test buffer</h3>

<p>Passing <code class="language-plaintext highlighter-rouge">buf</code> and <code class="language-plaintext highlighter-rouge">len</code> straight into the target is the most common
mistake, especially when fuzzing better-designed C programs, and
particularly because the official template encourages it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">myprogram</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>  <span class="c1">// BAD!</span>
</code></pre></div></div>

<p>While it’s a great sign the program doesn’t depend on null termination, it
creates a subtle trap. The underlying buffer allocated by AFL++ is larger
than <code class="language-plaintext highlighter-rouge">len</code>, and ASan will not detect read overflows on inputs! Instead
pass a copy sized to fit, which is the purpose of <code class="language-plaintext highlighter-rouge">src</code> in my template.
Adjust the type of <code class="language-plaintext highlighter-rouge">src</code> as needed.</p>

<p>If the program expects null-terminated input then you’ll need to do this
anyway in order to append the null byte. If it accepts an “owning” type
like <code class="language-plaintext highlighter-rouge">std::string</code>, then it’s also already done on your behalf. With
“non-owning” views like <code class="language-plaintext highlighter-rouge">std::string_view</code> you’ll still want to your own
size-fit copy.</p>

<p>If you see a program’s checked in fuzz test using <code class="language-plaintext highlighter-rouge">buf</code> directly, make
this change and see if anything new pops out. It’s worked for me on a
number of occasions.</p>

<h3 id="6-dont-bother-freeing-memory">(6) Don’t bother freeing memory</h3>

<p>In general, avoid doing work irrelevant to the fuzz test. The official
tips say to “use a simpler target” and “instrument just what you need,”
and keeping destructors out of the tests helps in both cases. Unless the
program is especially memory-hungry, you won’t run out of memory before
AFL++ resets the target process.</p>

<p>If not for (1), it also helps with isolation (4), as different tests are
less likely contaminated with uninitialized memory from previous tests.</p>

<p>As an exception, if you want your destructor included in the fuzz test,
then use it in the test. Also, it’s easy to exhaust non-memory resources,
particularly file descriptors, and you may need to <a href="https://man7.org/linux/man-pages/man2/close_range.2.html">clean those up</a>
in order to fuzz test reliably.</p>

<p>Of course, if the target uses <a href="/blog/2023/09/27/">arena allocation</a> then none of this
matters! It also makes for perfect isolation, as even addresses won’t vary
between tests.</p>

<h3 id="7-use-a-memory-file-descriptor-to-back-named-paths">(7) Use a memory file descriptor to back named paths</h3>

<p>Many interfaces are, shall we say, <em>not so well-designed</em> and only accept
input from a named file system path, insisting on opening and reading the
file themselves. Testing such interfaces presents challenges, especially
if you’re interested in parallel fuzzing. Fortunately there’s usually an
easy out: Create a memory file descriptor and use its <code class="language-plaintext highlighter-rouge">/proc</code> name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">memfd_create</span><span class="p">(</span><span class="s">"fuzz"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="mi">3</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(...)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">pwrite</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">myprogram</span><span class="p">(</span><span class="s">"/proc/self/fd/3"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With standard input as 0, output as 1, and error as 2, I’ve assumed the
memory file descriptor will land on 3, which makes the test code a little
simpler. If it’s not 3 then something’s probably gone wrong anyway, and
aborting is the best option. If you don’t want to assume, use <code class="language-plaintext highlighter-rouge">snprintf</code>
or whatever to construct the path name from <code class="language-plaintext highlighter-rouge">fd</code>.</p>

<p>Using <code class="language-plaintext highlighter-rouge">pwrite</code> (instead of <code class="language-plaintext highlighter-rouge">write</code>) leaves the file description offset at
the beginning of the file.</p>

<p>Thanks to the memory file descriptor, fuzz test data doesn’t land in
permanent storage, so less wear and tear on your SSD from the occasional
flush. Because of <code class="language-plaintext highlighter-rouge">/proc</code>, the file is unique to the process despite the
common path name, so no problems parallel fuzzing. No cleanup needed,
either.</p>

<p>If the program wants a file descriptor — i.e. it wants a socket because
you’re fuzzing some internal function — pass the file descriptor directly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">myprogram</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>If it accepts a <code class="language-plaintext highlighter-rouge">FILE *</code>, you <em>could</em> <code class="language-plaintext highlighter-rouge">fopen</code> the <code class="language-plaintext highlighter-rouge">/proc</code> path, but better
to use <code class="language-plaintext highlighter-rouge">fdmemopen</code> to create a <code class="language-plaintext highlighter-rouge">FILE *</code> on the object:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">myprogram</span><span class="p">(</span><span class="n">fdmemopen</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">));</span>
</code></pre></div></div>

<p>Note how, per (6), we don’t need to bother with <code class="language-plaintext highlighter-rouge">fclose</code> because it’s not
associated with a file descriptor.</p>

<h3 id="8-configure-the-target-for-smaller-buffers">(8) Configure the target for smaller buffers</h3>

<p>A common sight in <a href="http://catb.org/jargon/html/C/C-Programmers-Disease.html">diseased programs</a> are “generous” fixed buffer
sizes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MY_MAX_BUFFER_LENGTH 65536
</span>
<span class="kt">void</span> <span class="nf">example</span><span class="p">(...)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="n">PATH_MAX</span><span class="p">];</span>  <span class="c1">// typically 4,096</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">MY_MAX_BUFFER_LENGTH</span><span class="p">];</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These huge buffers tend to hide bugs. Turn those stones over! It takes a
lot of fuzzing time to max them out and excite the unhappy paths — or the
super-unhappy paths, overflows. Better if the fuzz test can reach worst
case conditions quickly and explore the execution paths out of it.</p>

<p>So when you see these, cut them way down, possibly using (3). Change 65536
to, say, 16 and see what happens. If fuzzing finds a crash on the short
buffer, typically extending the input to crash on the original buffer size
is straightforward, e.g. repeat one of the bytes even more than it already
repeats.</p>

<h3 id="conclusion-and-samples">Conclusion and samples</h3>

<p>Hopefully something here will help you catch a defect that would have
otherwise gone unnoticed. Even better, perhaps awareness of these fuzzing
techniques will prevent the bug in the first place. Thanks to my template,
some solid tooling, and the know-how in this article, I can whip up a fuzz
test in a couple of minutes. But that ease means I discard it as just as
casually, and so I don’t take time to capture and catalog most. If you’d
like to see some samples, <a href="https://old.reddit.com/r/C_Programming/comments/15wouat/_/jx2ld4a/">I do have an old, short list</a>. Perhaps
after another kiloproject of fuzz testing I’ll pick up more techniques.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Examples of quick hash tables and dynamic arrays in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2025/01/19/"/>
    <id>urn:uuid:d139d0bc-af7b-4e0e-94f2-566312f92290</id>
    <updated>2025-01-19T04:10:33Z</updated>
    <category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p>This article durably captures <a href="https://old.reddit.com/r/C_Programming/comments/1hrvhfl/_/m51saq2/">my reddit comment</a> showing techniques
for <code class="language-plaintext highlighter-rouge">std::unordered_map</code> and <code class="language-plaintext highlighter-rouge">std::vector</code> equivalents in C programs. The
core, important features of these data structures require only a dozen or
so lines of code apiece. They compile quickly, and tend to run faster in
debug builds than <em>release builds</em> of their C++ equivalents. What they
lack in genericity they compensate in simplicity. Nothing here will be
new. Everything has been covered in greater detail previously, which I
will reference when appropriate.</p>

<p>For a concrete goal, we will build a data structure representing an
process environment, along with related functionality to make it more
interesting. That is, we’ll build a string-to-string map.</p>

<h3 id="allocator">Allocator</h3>

<p>The foundation is our allocator, a simple <a href="/blog/2023/09/27/">bump allocator</a>, so
we’ll start there:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">align</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">pad</span> <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span><span class="o">/</span><span class="n">size</span><span class="p">);</span>  <span class="c1">// TODO: OOM policy</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+</span> <span class="n">pad</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+=</span> <span class="n">pad</span> <span class="o">+</span> <span class="n">count</span><span class="o">*</span><span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">memset</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">count</span><span class="o">*</span><span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Allocating through the <code class="language-plaintext highlighter-rouge">new</code> macro eliminates several classes of common
defects in C programs. If we get our types mixed up we get errors, or at
least warnings. Our <a href="/blog/2024/05/24/">size calculations cannot overflow</a>. We cannot
accidentally use uninitialized memory. We cannot leak memory; deallocating
is implicit. The main downside is that it doesn’t fit some less common
allocator requirements.</p>

<h3 id="strings">Strings</h3>

<p>Next, a string representation. Classic <a href="https://www.symas.com/post/the-sad-state-of-c-strings">null-terminated strings are an
error-prone paradigm</a>, so we’ll use <a href="/blog/2024/04/14/">counted strings</a> instead:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s)    (Str){s, sizeof(s)-1}
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>
</code></pre></div></div>

<p>This is equivalent to a <code class="language-plaintext highlighter-rouge">std::string_view</code> in C++. The macro allows us to
efficiently convert string literals into <code class="language-plaintext highlighter-rouge">Str</code> objects. Because our data
structures are backed by arenas, we won’t care whether a particular string
is backed by a static string, arena, memory map, etc. We’ll also need a
function to compare strings for equality:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">equals</span><span class="p">(</span><span class="n">Str</span> <span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">len</span> <span class="o">!=</span> <span class="n">b</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">a</span><span class="p">.</span><span class="n">len</span> <span class="o">||</span> <span class="o">!</span><span class="n">memcmp</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">!a.len</code> appears superfluous, but it’s necessary: <code class="language-plaintext highlighter-rouge">memcmp</code> <a href="/blog/2023/02/11/#strings">arbitrarily
forbids null pointers</a>, and we may be passed a zero-initialized
<code class="language-plaintext highlighter-rouge">Str</code>. Though <a href="https://developers.redhat.com/articles/2024/12/11/making-memcpynull-null-0-well-defined">this is scheduled to be corrected</a>.</p>

<p>We’ll need a string hash function, too:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash64</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mh">0x100</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mi">255</span><span class="p">;</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is an FNV-style hash. The “basis” keeps strings of nulls from getting
stuck at zero, and the multiplier is my favorite prime number. Character
data is fixed to 0–255 rather than allowing the signedness of <code class="language-plaintext highlighter-rouge">char</code> to
influence the results. As a multiplicative hash, the high bits are mixed
better than the low bits, and our maps will take that into account.</p>

<h3 id="flat-hash-map">Flat hash map</h3>

<p>We have a couple string-to-string map options. The more restrictive, but
more efficient — in terms of memory use and speed — is a <a href="/blog/2022/08/08/">Mask-Step-Index
(MSI) hash table</a>. I don’t think it fits our problem as well as the
next option, particularly because it puts a hard limit on unique keys, but
it’s worth evaluating. Let’s call it <code class="language-plaintext highlighter-rouge">FlatEnv</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="p">{</span> <span class="n">ENVEXP</span> <span class="o">=</span> <span class="mi">10</span> <span class="p">};</span>  <span class="c1">// support up to 1,000 unique keys</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">keys</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">ENVEXP</span><span class="p">];</span>
    <span class="n">Str</span> <span class="n">vals</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">ENVEXP</span><span class="p">];</span>
<span class="p">}</span> <span class="n">FlatEnv</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s nothing more than two fixed-length arrays, storing keys and values
separately. Keys with null pointers are empty slots, so a zero-initialized
<code class="language-plaintext highlighter-rouge">FlatEnv</code> is an empty table. They come out of an arena ready-to-use:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">FlatEnv</span><span class="p">);</span>  <span class="c1">// new, empty environment</span>
</code></pre></div></div>

<p>Now we leverage <code class="language-plaintext highlighter-rouge">equals</code> and <code class="language-plaintext highlighter-rouge">hash64</code> for a double-hashed, open address
search on the keys array:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="o">*</span><span class="nf">flatlookup</span><span class="p">(</span><span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">ENVEXP</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">ENVEXP</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">env</span><span class="o">-&gt;</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">vals</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">vals</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By returning a pointer to the unmodified value slot, this function covers
both lookup and insertion. So that’s the entire hash table implementation.
To insert, the caller assigns the slot. For mere lookup, check the slot
for a null pointer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">FlatEnv</span><span class="p">);</span>

    <span class="c1">// insert</span>
    <span class="o">*</span><span class="n">flatlookup</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"hello"</span><span class="p">))</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span><span class="s">"world"</span><span class="p">);</span>

    <span class="c1">// lookup</span>
    <span class="n">Str</span> <span class="n">val</span> <span class="o">=</span> <span class="o">*</span><span class="n">flatlookup</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">key</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">val</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%.*s = %.*s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">key</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">key</span><span class="p">.</span><span class="n">data</span><span class="p">,</span>
                                <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">val</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">val</span><span class="p">.</span><span class="n">data</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>To iterate over the map entries, iterate over the arrays, skipping null
entries. Per the <code class="language-plaintext highlighter-rouge">ENVEXP</code> comment, it’s hard-coded to support up to 1,000
unique keys (1,024 slots, leaving some to spare). The table itself doesn’t
enforce this limit and will turn into an infinite loop if you insert too
many keys. To support scaling, we could design the map to have dynamic
table sizes, track the number of unique keys, and resize the table
(allocate new arrays) when the load factor crosses a threshold. Resizing
sounds messy and complicated, so fortunately there’s another option.</p>

<h3 id="hierarchical-hash-map">Hierarchical hash map</h3>

<p>If the number of keys is unbounded, <a href="/blog/2023/09/30/">hash tries</a> work better. Trees
scale well, and we can allocate nodes out of the arena as it grows. We’ll
use a 4-ary trie, a good default that balances size and performance:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">Env</span> <span class="n">Env</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">Env</span> <span class="p">{</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">Str</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">Str</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>An empty map is just a null pointer, and so, again, these maps come
ready-to-use in their zero state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// new, empty environment</span>
</code></pre></div></div>

<p>The implementation is equally as brief:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="o">*</span><span class="nf">lookup</span><span class="p">(</span><span class="n">Env</span> <span class="o">**</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">env</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">a</span><span class="p">)</span> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Env</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">env</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like before, this covers both lookup and insertion, though the mode is
determined explicitly by the arena pointer. Without an arena, it’s a
lookup, which doesn’t require allocation. With an arena, it creates an
entry if necessary and, like before, returns a pointer into the map so
that the caller can assign it. Usage differs only slightly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// insert</span>
    <span class="o">*</span><span class="n">lookup</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"hello"</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span><span class="s">"world"</span><span class="p">);</span>

    <span class="c1">// lookup</span>
    <span class="n">Str</span> <span class="o">*</span><span class="n">val</span> <span class="o">=</span> <span class="n">lookup</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%.*s = %.*s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">key</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">key</span><span class="p">.</span><span class="n">data</span><span class="p">,</span>
                                <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">val</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">,</span> <span class="n">val</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>We’ll come back around to iteration later.</p>

<h3 id="string-concatenation">String concatenation</h3>

<p>Next I’d like a function that takes an <code class="language-plaintext highlighter-rouge">Env</code> and produces an <code class="language-plaintext highlighter-rouge">envp</code> data
structure as expected by <a href="https://man7.org/linux/man-pages/man2/execve.2.html"><code class="language-plaintext highlighter-rouge">execve(2)</code></a>. Then we can use this map as
the environment in a child process. We’ll need some string manipulation,
particularly <a href="/blog/2024/05/25/">string concatenation</a>. The core is a copy function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">copy</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Str</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="kt">char</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like with <code class="language-plaintext highlighter-rouge">memcmp</code>, because it’s <code class="language-plaintext highlighter-rouge">memcpy</code> we need to handle the arbitrary
special case around null pointers should the input be a zero <code class="language-plaintext highlighter-rouge">Str</code>. Now we
can easily concatenate strings, <em>in-place if possible</em>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="nf">concat</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Str</span> <span class="n">head</span><span class="p">,</span> <span class="n">Str</span> <span class="n">tail</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">head</span><span class="p">.</span><span class="n">data</span> <span class="o">||</span> <span class="n">head</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">head</span><span class="p">.</span><span class="n">len</span> <span class="o">!=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">copy</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">head</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">head</span><span class="p">.</span><span class="n">len</span> <span class="o">+=</span> <span class="n">copy</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">tail</span><span class="p">).</span><span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Yet again, <code class="language-plaintext highlighter-rouge">!head.data</code> is special check because pointer arithmetic on
null (i.e. adding zero to null) is arbitrarily disallowed. Worrying about
this is exhausting, isn’t it? That language fix can’t come soon enough.
This one’s already fixed in C++.</p>

<p>That’s enough to get the ball rolling on <code class="language-plaintext highlighter-rouge">FlatEnv</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">**</span><span class="nf">flat_to_envp</span><span class="p">(</span><span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span>    <span class="n">cap</span>  <span class="o">=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">ENVEXP</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">cap</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span>    <span class="n">len</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">cap</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">vals</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">Str</span> <span class="n">pair</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">keys</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"="</span><span class="p">));</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">vals</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"</span><span class="se">\0</span><span class="s">"</span><span class="p">));</span>
            <span class="n">envp</span><span class="p">[</span><span class="n">len</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">envp</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simple, right? Traditional string handling in C is an error-prone pain,
but with a better set of primitives it’s a breeze. Plus we’re doing this
all with essentially no runtime. In use this might look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">shellexec</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">cmd</span><span class="p">,</span> <span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span>  <span class="o">*</span><span class="n">argv</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"sh"</span><span class="p">,</span> <span class="s">"-c"</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span>   <span class="o">=</span> <span class="n">flat_to_envp</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="n">execve</span><span class="p">(</span><span class="s">"/bin/sh"</span><span class="p">,</span> <span class="n">argv</span><span class="p">,</span> <span class="n">envp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By virtue of the scratch arena, the <code class="language-plaintext highlighter-rouge">envp</code> object is automatically freed
should <code class="language-plaintext highlighter-rouge">execve</code> fail. (If that should even matter.) Considering this, if
you’re itching to write the fastest shell ever devised, arena allocation
and the techniques in this article would probably get you most of the way
there. Nobody writes shells this way.</p>

<h3 id="dynamic-arrays">Dynamic arrays</h3>

<p>To implement the <code class="language-plaintext highlighter-rouge">envp</code> conversion for the hash trie <code class="language-plaintext highlighter-rouge">Env</code>, let’s add one
more tool to our toolbox: dynamic arrays. Our <code class="language-plaintext highlighter-rouge">std::vector</code> equivalent.
We’ll start with <a href="/blog/2023/10/05/">a familiar slice header</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>    <span class="o">**</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
<span class="p">}</span> <span class="n">EnvpSlice</span><span class="p">;</span>
</code></pre></div></div>

<p>The bad news is that we don’t have templates, and so we’ll need to define
one such structure for each type of which we want a dynamic array. This
one is set up to create an <code class="language-plaintext highlighter-rouge">envp</code> array. The good news is that manipulation
occurs through generic code, so everything else is reusable.</p>

<p>I want a <code class="language-plaintext highlighter-rouge">push</code> macro that creates an empty slot in which to insert a new
value, evaluating to a pointer to this slot. Usually that means
incrementing <code class="language-plaintext highlighter-rouge">len</code>, but when out of room it will need to expand the
underlying storage. It’s clearer to start with example usage. Imagine
using it with the previous <code class="language-plaintext highlighter-rouge">flat_to_envp</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">**</span><span class="nf">flat_to_envp</span><span class="p">(</span><span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">EnvpSlice</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">ENVEXP</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">vals</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// ... concat as before ...</span>
            <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">);</span>  <span class="c1">// terminal null pointer</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Continuing the theme, a zero-initialized slice is a ready-to-use empty
slice, and most begin life this way. The immediate dereference on <code class="language-plaintext highlighter-rouge">push</code>
is just like those calls to <code class="language-plaintext highlighter-rouge">lookup</code>. If expansion is needed, the <code class="language-plaintext highlighter-rouge">push</code>
macro’s job is to pull fields off the slice, pass them into a helper
function which agnostically, strict-aliasing-legally, manipulates the
slice header:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">push_</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">pcap</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">);</span>

<span class="cp">#define push(a, s) \
  ((s)-&gt;len == (s)-&gt;cap \
    ? (s)-&gt;data = push_((a), (s)-&gt;data, &amp;(s)-&gt;cap, sizeof(*(s)-&gt;data)), \
      (s)-&gt;data + (s)-&gt;len++ \
    : (s)-&gt;data + (s)-&gt;len++)
</span></code></pre></div></div>

<p>The internals of that helper look an awful lot like <code class="language-plaintext highlighter-rouge">concat</code>, with the
same in-place-if-possible behavior:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="p">{</span> <span class="n">SLICE_INITIAL_CAP</span> <span class="o">=</span> <span class="mi">4</span> <span class="p">};</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">push_</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="o">*</span><span class="n">pcap</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span>   <span class="o">=</span> <span class="o">*</span><span class="n">pcap</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">align</span> <span class="o">=</span> <span class="k">_Alignof</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">data</span> <span class="o">||</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">!=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">data</span> <span class="o">+</span> <span class="n">cap</span><span class="o">*</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">copy</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">cap</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">cap</span><span class="o">*</span><span class="n">size</span><span class="p">);</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">copy</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">ptrdiff_t</span> <span class="n">extend</span> <span class="o">=</span> <span class="n">cap</span> <span class="o">?</span> <span class="n">cap</span> <span class="o">:</span> <span class="n">SLICE_INITIAL_CAP</span><span class="p">;</span>
    <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">extend</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>  <span class="c1">// already aligned</span>
    <span class="o">*</span><span class="n">pcap</span> <span class="o">=</span> <span class="n">cap</span> <span class="o">+</span> <span class="n">extend</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(<strong>Update</strong>: Aleh pointed out an inefficiency in the original code:
<a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCAB2_dQWNOKCSCa8L8khH2W0eunsKK-_CkJZaDUpRAA4AFMG8Jg@mail.gmail.com%3E">applying alignment in the second <code class="language-plaintext highlighter-rouge">alloc</code> may introduce unnecessary
fragmentation</a>. This has been corrected above.)</p>

<p>For unfathomable reasons, standard C does not permit <code class="language-plaintext highlighter-rouge">_Alignof</code> on
expressions, so slice data is simply pointer-aligned. (The more shrewd
might consider <code class="language-plaintext highlighter-rouge">max_align_t</code>.) Like concatenation, we copy the object to
the beginning of the arena if necessary, and extend the allocation by
allocating the usual way, being careful not to increment the capacity
until after it succeeds.</p>

<p><strong>Update</strong>: <a href="https://old.reddit.com/r/C_Programming/comments/1i74hii/_/m8l40fo/">NRK points out</a> we can use <code class="language-plaintext highlighter-rouge">__typeof__</code> (extension) or
<code class="language-plaintext highlighter-rouge">typeof</code> (C23), to work around this syntactical limitation of <code class="language-plaintext highlighter-rouge">_Alignof</code>.
Convert the <code class="language-plaintext highlighter-rouge">align</code> local variable into a parameter:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">push_</span><span class="p">(...,</span> <span class="kt">ptrdiff_t</span> <span class="n">align</span><span class="p">);</span>
</code></pre></div></div>

<p>Then in the macro pass it via <code class="language-plaintext highlighter-rouge">_Alignof(__typeof__(…))</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define push(a, s) \
  ((s)-&gt;len == (s)-&gt;cap \
    ? (s)-&gt;data = push_((a), (s)-&gt;data, &amp;(s)-&gt;cap, \
          sizeof(*(s)-&gt;data), _Alignof(__typeof__(*(s)-&gt;data))), \
      (s)-&gt;data + (s)-&gt;len++ \
    : (s)-&gt;data + (s)-&gt;len++)
</span></code></pre></div></div>

<p>Spelled as an extension, it already works with all major C compilers from
the past decade, and without requiring special compiler flags.</p>

<p>We can now use <code class="language-plaintext highlighter-rouge">push</code> on any structure with <code class="language-plaintext highlighter-rouge">data</code>, <code class="language-plaintext highlighter-rouge">len</code>, and <code class="language-plaintext highlighter-rouge">cap</code>
fields of the appropriate types.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>With that in place, we can define a simple, recursive version of the
<code class="language-plaintext highlighter-rouge">envp</code> builder for <code class="language-plaintext highlighter-rouge">Env</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define countof(a)  ((ptrdiff_t)(sizeof(a) / sizeof(*(a))))
</span>
<span class="n">EnvpSlice</span> <span class="nf">env_to_envp_</span><span class="p">(</span><span class="n">EnvpSlice</span> <span class="n">r</span><span class="p">,</span> <span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Str</span> <span class="n">pair</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"="</span><span class="p">));</span>
        <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">);</span>
        <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"</span><span class="se">\0</span><span class="s">"</span><span class="p">));</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">countof</span><span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">);</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">r</span> <span class="o">=</span> <span class="n">env_to_envp_</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">a</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">char</span> <span class="o">**</span><span class="nf">env_to_envp</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">EnvpSlice</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">env_to_envp_</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
    <span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">);</span>  <span class="c1">// null pointer terminator</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As is often the case, the recursive part doesn’t fit the final interface,
so the core is a helper, and the caller-facing part is an adapter. I’m not
<em>entirely</em> comfortable with this function, though. When working with huge
environments — over a ~100k entries — then the recursive implementation
will non-deterministically blow the stack if the trie winds up lopsided.
Or deterministically for chosen pathological inputs, because the hash
function isn’t seeded.</p>

<p>Instead we could use a stack data structure backed by the arena to
traverse the trie. If passed a secondary scratch arena, we’d use that
arena for this stack, but I’m sticking to the original interface. Here’s
what that looks like, with an extra trick thrown in just to show off:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">**</span><span class="nf">env_to_envp_safe</span><span class="p">(</span><span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">EnvpSlice</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

    <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
        <span class="n">Env</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span>
        <span class="kt">int</span>  <span class="n">index</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">Frame</span><span class="p">;</span>
    <span class="n">Frame</span> <span class="n">init</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>  <span class="c1">// small size optimization</span>

    <span class="k">struct</span> <span class="p">{</span>
        <span class="n">Frame</span>    <span class="o">*</span><span class="n">data</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">stack</span> <span class="o">=</span> <span class="p">{</span><span class="n">init</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">countof</span><span class="p">(</span><span class="n">init</span><span class="p">)};</span>

    <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">stack</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">Frame</span><span class="p">){</span><span class="n">env</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">stack</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">Frame</span> <span class="o">*</span><span class="n">top</span> <span class="o">=</span> <span class="n">stack</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="n">stack</span><span class="p">.</span><span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">top</span><span class="o">-&gt;</span><span class="n">env</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">stack</span><span class="p">.</span><span class="n">len</span><span class="o">--</span><span class="p">;</span>

        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">top</span><span class="o">-&gt;</span><span class="n">index</span> <span class="o">==</span> <span class="n">countof</span><span class="p">(</span><span class="n">top</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">Str</span> <span class="n">pair</span> <span class="o">=</span> <span class="n">top</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"="</span><span class="p">));</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">top</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">);</span>
            <span class="n">pair</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">pair</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"</span><span class="se">\0</span><span class="s">"</span><span class="p">));</span>
            <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
            <span class="n">stack</span><span class="p">.</span><span class="n">len</span><span class="o">--</span><span class="p">;</span>

        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">top</span><span class="o">-&gt;</span><span class="n">index</span><span class="o">++</span><span class="p">;</span>
            <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">stack</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">Frame</span><span class="p">){</span><span class="n">top</span><span class="o">-&gt;</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">0</span><span class="p">};</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="n">push</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">r</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">init</code> array is a form of <a href="/blog/2016/10/07/">small-size optimization</a>. It’s used at
first, and sufficient for nearly all inputs. So no stack litter in the
arena. If it’s not enough, then <code class="language-plaintext highlighter-rouge">push</code> will <em>automatically move the stack
into the arena</em>. I think that’s a super duper neato trick!</p>

<p>Alternative to this, and as discussed in the original hash trie article,
we could instead add a <code class="language-plaintext highlighter-rouge">next</code> field to <code class="language-plaintext highlighter-rouge">Env</code> as an intrusive linked list
that chains the nodes together in insertion order. Or another way to look
at it, <code class="language-plaintext highlighter-rouge">Env</code> is a linked list with an <em>intrusive hash trie</em> for O(log n)
searches on the list. That’s a lot simpler, has other useful properties,
and only costs one extra pointer per entry. And we wouldn’t need slices,
which was my motivation for choosing non-linked-list approach above.</p>

<h3 id="hash-hardening-bonus">Hash hardening (bonus)</h3>

<p>Okay, I lied, this is something new. Think of it as your special treat for
sticking with me so far.</p>

<p>Hash map non-determinism comes with a classic security vulnerability: If
populated with untrusted keys, an attacker could choose colliding keys and
produce worst case behavior in the hash map. That is, MSI hash tables
reduce to linear scans, and hash tries reduce to linked lists. Worse, the
recursive <code class="language-plaintext highlighter-rouge">envp</code> function blows the stack, though we already solved that
issue.</p>

<p>If we want to foil such attacks, we can seed the hash so that an attacker
cannot devise collisions. They’d need to discover the seed. We might even
call that seed a “key,” but this is a non-cryprographic hash so I’m going
to avoid that term. The usual implementation of this concept involves
generating a seed, sometimes per table, and storing it somewhere. However,
we can leverage an existing security mechanism, gaining this feature at
basically no cost: Address Space Layout Randomization (ASLR). First, let’s
augment the string hash function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash64</span><span class="p">(</span><span class="n">Str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">seed</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">seed</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mi">255</span><span class="p">;</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">flatlookup</code> we can use the address of the <code class="language-plaintext highlighter-rouge">FlatEnv</code> as our seed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="o">*</span><span class="nf">flatlookup</span><span class="p">(</span><span class="n">FlatEnv</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">env</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Recall it’s allocated out of our arena (via <code class="language-plaintext highlighter-rouge">new</code>), and ASLR gives our
arena a random offset. On top of that, a <code class="language-plaintext highlighter-rouge">FlatEnv</code> seed depends precisely
on the amount of memory allocated earlier. An environment variable name or
value being slightly longer or shorter will reshuffle the whole table if
allocated in the arena before the <code class="language-plaintext highlighter-rouge">FlatEnv</code>.</p>

<p>It’s slightly trickier with hash tries. The root pointer isn’t required to
be fixed. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="c1">// ... insert keys ...</span>
    <span class="n">Env</span> <span class="o">*</span><span class="n">myenv</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span>
    <span class="c1">// ... lookup keys in myenv ...</span>
</code></pre></div></div>

<p>We could disallow this, but it would be easy to forget (e.g. while you’re
refactoring and not thinking about it) and difficult to detect.
Difficult-to-detect bugs keep me awake at night. Instead we can use the
root node to seed the trie:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Str</span> <span class="o">*</span><span class="nf">lookup</span><span class="p">(</span><span class="n">Env</span> <span class="o">**</span><span class="n">env</span><span class="p">,</span> <span class="n">Str</span> <span class="n">key</span><span class="p">,</span> <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">seed</span> <span class="o">=</span> <span class="n">env</span> <span class="o">?</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="o">*</span><span class="n">env</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">seed</span><span class="p">);</span> <span class="o">*</span><span class="n">env</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>At first this seems like it couldn’t work, like a chicken-and-egg problem.
There’s no root node at first, so we can’t know the seed yet. Though think
about it a little longer and it should be obvious: The hash is unused when
inserting the very first element. It simply becomes the root of the trie.
The seed is irrelevant until the second insert, at which point we’ve
established a seed. This delay establishing the seed means hash tries are
even more randomized.</p>

<p>With the proper tools and representations, working in C isn’t difficult
even if you need containers and string manipulation. Aside from <code class="language-plaintext highlighter-rouge">memcmp</code>
and <code class="language-plaintext highlighter-rouge">memcpy</code> — each easily replaceable — we did all this without runtime
assistance, not even its allocator. What a pleasant way to work!</p>

<p>Source from this article in runnable form, which I used to test my samples:
<a href="https://gist.github.com/skeeto/42d8a23871642696b6b8de30d9222328"><code class="language-plaintext highlighter-rouge">example.c</code></a></p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Rules to avoid common extended inline assembly mistakes</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/12/20/"/>
    <id>urn:uuid:594e546f-15c7-4834-bece-9c9f24122a01</id>
    <updated>2024-12-20T19:46:48Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>GCC and Clang inline assembly is an interface between high and low level
programming languages. It is subtle and treacherous. Many are ensnared in
its traps, usually unknowingly. As such, the <code class="language-plaintext highlighter-rouge">asm</code> keyword is essentially
the <code class="language-plaintext highlighter-rouge">unsafe</code> keyword of C and C++. Nearly every inline assembly tutorial,
including <a href="https://web.archive.org/web/20241216071150/https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html">the awful ibilio page</a> at the top of search engines for
decades, propagate fundamental, serious mistakes, and <em>most examples are
incorrect</em>. The dangerous part is that the examples <em>usually</em> produce the
expected results! The situation is dire. This article isn’t a tutorial,
but basic rules to avoid the most common mistakes, or to spot them in code
review.</p>

<p><strong>The focus is entirely <em>extended assembly</em>, and not <em>basic assembly</em></strong>,
which has different rules. The former is any inline assembly statement
with constraints or clobbers. That is, there’s a colon <code class="language-plaintext highlighter-rouge">:</code> token between
the <code class="language-plaintext highlighter-rouge">asm</code> parenthesis. Basic assembly is blunt and has fewer uses, mostly
at the top level or in <a href="/blog/2023/03/23/">“naked” functions</a>, making misuse less
likely.</p>

<h3 id="1-avoid-inline-assembly-if-possible">(1) Avoid inline assembly if possible</h3>

<p>Because it’s so treacherous, the first rule is to avoid it if at all
possible. Modern compilers are loaded with intrinsics and built-ins that
replace nearly all the old inline assembly use cases. They allow access to
low level features from the high level language. No need to bridge the gap
between low and high yourself when there’s an intrinsic.</p>

<p>Compilers do not have built-ins for system calls, and occasionally <a href="/blog/2024/01/28/">lack a
useful intrinsic</a>. Other times you might be building <a href="https://github.com/skeeto/scratch/blob/fbd3260e/misc/buddy.c#L594-#L616">foundational
infrastructure</a>. These remaining cases are mostly about interacting
with external interfaces, not optimization nor performance.</p>

<h3 id="2-it-should-nearly-always-be-volatile">(2) It should nearly always be volatile</h3>

<p>Falling right out of rule (1), the remaining inline assembly cases nearly
always have side effects beyond output constraints. That includes memory
accesses, and it certainly includes system calls. Because of this, inline
assembly should usually have the <code class="language-plaintext highlighter-rouge">volatile</code> qualifier.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="nf">volatile</span> <span class="p">(</span> <span class="p">...</span> <span class="p">);</span>
</code></pre></div></div>

<p>This prevents compilers from eliding or re-ordering the assembly. As a
special rule, inline assembly lacking output constraints is implicitly
volatile. Despite this, <em>please use <code class="language-plaintext highlighter-rouge">volatile</code> anyway!</em> When I do not see
<code class="language-plaintext highlighter-rouge">volatile</code> it’s likely a defect. Stopping to consider if it’s this special
case slows understanding and impedes code review.</p>

<p>Tutorials often use <code class="language-plaintext highlighter-rouge">__volatile__</code>. Do not do this. It is an ancient alias
keyword to support pre-standard compilers lacking the <code class="language-plaintext highlighter-rouge">volatile</code> keyword.
This is not your situation. When I see <code class="language-plaintext highlighter-rouge">__volatile__</code> it likely means you
copy-pasted the inline assembly from somewhere without understanding it.
It’s a red flag that draws my attention for even more careful review.</p>

<p>Side note: <code class="language-plaintext highlighter-rouge">__asm</code> or <code class="language-plaintext highlighter-rouge">__asm__</code> is fine, and even required in some cases
(e.g. <code class="language-plaintext highlighter-rouge">-std=cXX</code>). I usually write it <code class="language-plaintext highlighter-rouge">asm</code>.</p>

<h3 id="3-it-probably-needs-a-memory-clobber">(3) It probably needs a memory clobber</h3>

<p>The <code class="language-plaintext highlighter-rouge">"memory"</code> clobber is orthogonal to <code class="language-plaintext highlighter-rouge">volatile</code>, each serving different
purposes. It’s less often needed than <code class="language-plaintext highlighter-rouge">volatile</code>, but typical remaining
inline assembly cases require it. If memory is accessed in any way while
executing the assembly, you need a memory clobber. This includes most
system calls, and definitely a generic <code class="language-plaintext highlighter-rouge">syscall</code> wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">asm</span> <span class="nf">volatile</span> <span class="p">(...</span> <span class="o">:</span> <span class="s">"memory"</span><span class="p">);</span>
</code></pre></div></div>

<p>In code review, if you do not see a <code class="language-plaintext highlighter-rouge">"memory"</code> clobber, give it extra
scrutiny. It’s probably missing. If it’s truly unnecessary, I suggest
documenting such in a comment so that reviewers know the omission is
considered and intentional.</p>

<p>The constraint prevents compilers from re-ordering loads and stores around
the assembly. It would be disastrous, for example, if a <code class="language-plaintext highlighter-rouge">write(2)</code> system
call occurred before the program populated the output buffer! In this
case, <code class="language-plaintext highlighter-rouge">volatile</code> would prevent followup <code class="language-plaintext highlighter-rouge">write(2)</code> from being optimized
out while <code class="language-plaintext highlighter-rouge">"memory"</code> forces memory stores to occur before the system call.</p>

<h3 id="4-never-modify-input-constraints">(4) Never modify input constraints</h3>

<p>It’s easy not to modify inputs, so this is mostly about ignorance, but
this rule is broken with shocking frequency. Most of the time you can get
away with it, right up until certain configurations have a heisenbug. In
most cases this can be fixed by changing an input into read-write output
constraint with <code class="language-plaintext highlighter-rouge">"+"</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="nf">volatile</span> <span class="p">(</span><span class="s">"..."</span> <span class="o">::</span> <span class="s">"r"</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">:</span> <span class="p">...);</span>  <span class="c1">// before</span>
<span class="n">asm</span> <span class="nf">volatile</span> <span class="p">(</span><span class="s">"..."</span> <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">:</span> <span class="p">...);</span>  <span class="c1">// after</span>
</code></pre></div></div>

<p>If you hadn’t been using <code class="language-plaintext highlighter-rouge">volatile</code> (in violation of rule 2) then now
suddenly you’d need it because there’s an output constraint. This happens
often.</p>

<h3 id="5-never-call-functions-from-inline-assembly">(5) Never call functions from inline assembly</h3>

<p>Many things can go wrong because the semantics cannot be expressed using
inline assembly constraints. The stack may not be aligned, and you’ll
clobber the redzone. (Yes, there’s a <code class="language-plaintext highlighter-rouge">"redzone"</code> constraint, but its
insufficient to actually make a function call.) Do not do it. Tutorials
like to show it because it makes for a simple demonstration, but all those
examples are littered with defects.</p>

<p>System calls are fine. Basic assembly may call functions when used outside
of non-naked functions. The <code class="language-plaintext highlighter-rouge">goto</code> qualifier, used correctly, allows jumps
to be safely expressed to the compiler. Just don’t use <code class="language-plaintext highlighter-rouge">call</code> in extended
assembly.</p>

<h3 id="6-do-not-define-absolute-assembly-labels">(6) Do not define absolute assembly labels</h3>

<p>That is, if you need to jump within your assembly block, such as for a
loop, do not write a named label:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myloop:
    ...
    jz myloop
</code></pre></div></div>

<p>Your inline assembly is part of a function, and that function may be
cloned or inlined, in which case there will be <em>multiple copies of your
assembly block</em> in the translation unit. The assembler will see duplicate
label names and reject the program. Until that function is inlined,
perhaps at a high optimization level, this will likely work as expected.
On the plus side it’s a loud compile time error when it doesn’t work.</p>

<p>In inline assembly you can have the compiler generate a unique label with
<code class="language-plaintext highlighter-rouge">%=</code>, but my preferred solution is the <a href="https://sourceware.org/binutils/docs/as/Symbol-Names.html">local labels</a> feature of the
assembler:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0:
    ...
    jz 0b
</code></pre></div></div>

<p>In this case the assembler generates unique labels, and the number <code class="language-plaintext highlighter-rouge">0</code>
isn’t the literal label name. <code class="language-plaintext highlighter-rouge">0b</code> (“backward”) refers to the previous <code class="language-plaintext highlighter-rouge">0</code>
label, and <code class="language-plaintext highlighter-rouge">0f</code> (“forward”) would refer to the next <code class="language-plaintext highlighter-rouge">0</code> label. Perfectly
unambiguous.</p>

<h3 id="naturally-occurring-practice-problems">Naturally occurring practice problems</h3>

<p>Now that you’ve made it this far, here’s an exercise for practice: Search
online for “inline assembly tutorial” and count the defects you find by
applying my 6 rules. You’ll likely find at least one per result that isn’t
<a href="https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html">official compiler documentation</a>. Besides tutorials and reviewing
real programs, you could <a href="/blog/2024/11/10/">ask an LLM to generate inline assembly</a>, as
they’ve been been trained to produce these common defects.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Windows dynamic linking depends on the active code page</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/10/07/"/>
    <id>urn:uuid:cc7861a5-aaa0-4a27-8867-9f48cf72e444</id>
    <updated>2024-10-07T19:50:17Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Windows paths have been <a href="https://simonsapin.github.io/wtf-8/#ill-formed-utf-16">WTF-16</a>-encoded for decades, but module names
in the <a href="/blog/2024/06/30/">import tables</a> of <a href="https://learn.microsoft.com/en-us/windows/win32/debug/pe-format">Portable Executable</a> are octets.
If a name contains values beyond ASCII — technically out of spec — then
the dynamic linker must somehow decode those octets into Unicode in order
to construct a lookup path. There are multiple ways this could be done,
and the most obvious is the process’s active code page (ACP), which is
exactly what happens. As a consequence, the specific DLL loaded by the
linker may depend on the system code page. In this article I’ll contrive
such a situation.</p>

<p><a href="https://learn.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibrarya">LoadLibraryA</a> is a similar situation, and potentially applies the code
page to a longer portion of the module path. <a href="https://learn.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibraryw">LoadLibraryW</a> is
unaffected, at least for the directly-named module, because it’s Unicode
all the way through.</p>

<p>For my contrived demonstration I came up with two names that to
English-reading eyes appears as two words with extraneous markings:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">Ãµral.dll</code>: CP-1252=<code class="language-plaintext highlighter-rouge">"C3 B5 …"</code></li>
  <li><code class="language-plaintext highlighter-rouge">õral.dll</code>: CP-1252=<code class="language-plaintext highlighter-rouge">"F5 …"</code>; UTF-8=<code class="language-plaintext highlighter-rouge">"C3 B5 …"</code></li>
</ul>

<p>Both end with <code class="language-plaintext highlighter-rouge">ral.dll</code>. I’ve included the <a href="https://en.wikipedia.org/wiki/Windows-1252">CP-1252</a> encoding for the
differing prefixes, and the UTF-8 encoding for the second. I’m using
CP-1252 because it’s the most common system code page in the world,
especially the Western hemisphere. Due to case insensitivity, the actual
DLL may be named <code class="language-plaintext highlighter-rouge">ãµral.dll</code> — i.e. to match the second library case — but
the module name <em>must</em> be encoded as uppercase when <a href="/blog/2021/05/31/">building the import
library</a>. Alternatively the second could be <code class="language-plaintext highlighter-rouge">Õral.dll</code>, particularly
because I won’t use it when constructing an import library.</p>

<p>The plan is to store the octets <code class="language-plaintext highlighter-rouge">C3 B5 …</code> in the import table. A process
using CP-1252 decodes it to <code class="language-plaintext highlighter-rouge">Ãµral.dll</code>. In the UTF-8 code page it decodes
to <code class="language-plaintext highlighter-rouge">õral.dll</code>. For testing we can use an <a href="/blog/2021/12/30/">application manifest</a> to
control the code page for a particular PE image — a lot easier than
changing the system code page. Otherwise, this trick could dynamically
change the behavior of a program in response to the system code page
without actually inspecting the active code page.</p>

<p>The libraries will have a single function <code class="language-plaintext highlighter-rouge">get</code>, which returns a string
indicating which library was loaded:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define X(s) #s
#define S(s) X(s)
</span><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">get</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">S</span><span class="p">(</span><span class="n">V</span><span class="p">);</span> <span class="p">}</span>
</code></pre></div></div>

<p>Constructing the import library can be tricky because you must consider
how the toolchain, editors, and shells decode and encode text, which may
involve the build system’s code page. It’s shockingly difficult to script!
Binutils <code class="language-plaintext highlighter-rouge">dlltool</code> cannot process these names and cannot be used at all.
With bleeding edge <a href="https://github.com/skeeto/w64devkit">w64devkit</a> I could reliably construct the DLLs and
import library like so, even in a script (Windows 10 and later only):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -shared -DV=UTF-8 -o Õral.dll  detect.c
$ gcc -shared -DV=ANSI  -o Ãµral.dll detect.c -Wl,--out-implib=detect.lib
</code></pre></div></div>

<p>That produces two DLLs and one import library, <code class="language-plaintext highlighter-rouge">detect.lib</code>, with the
desired module name octets. A straightforward MSVC <code class="language-plaintext highlighter-rouge">cl</code> invocation also
works so long as it’s not from a batch file. It will quite correctly warn
about the strange name situation, which I like. My test program, <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">char</span> <span class="o">*</span><span class="nf">get</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="n">puts</span><span class="p">(</span><span class="n">get</span><span class="p">());</span> <span class="p">}</span>
</code></pre></div></div>

<p>I link <code class="language-plaintext highlighter-rouge">detect.lib</code> when I build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o main.exe main.c detect.lib
</code></pre></div></div>

<p>I designed <a href="/blog/2024/06/30/"><code class="language-plaintext highlighter-rouge">peports</code></a> to print non-ASCII octets unambiguously
(<code class="language-plaintext highlighter-rouge">\xXX</code>), and it’s the only tool I know that does so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ peports main.exe | tail -n 2
\xc3\xb5ral.dll
        1       get
</code></pre></div></div>

<p>The module name has the <code class="language-plaintext highlighter-rouge">C3 B5 …</code> prefix octets. When I run it under my
system code page, CP-1252:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./main
ANSI
</code></pre></div></div>

<p>If I <a href="/blog/2021/12/30/">add a UTF-8 manifest</a>, even just a “side-by-side” manifest, it
loads the other library despite an identical import table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o main.exe main.c detect.lib libwinsane.o
$ ./main
UTF-8
</code></pre></div></div>

<p>Again, without the manifest, if I switched my system code page to UTF-8
then <code class="language-plaintext highlighter-rouge">UTF-8</code> would still be the result.</p>

<p>I can’t think of much practical use for this trick outside of malware. In
a real program it would be simpler to inspect code page, and there’s no
benefit to avoiding such a check if it’s needed. Malware could use it to
trick inspection tools and scanners that decode module names differently
than the dynamic linker. Such tools often incorrectly assume UTF-8, which
is what motivated this article.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Giving C++ std::regex a C makeover</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/09/04/"/>
    <id>urn:uuid:83fb81ed-290e-4bc7-87bd-d0bbc6c01d25</id>
    <updated>2024-09-04T17:15:07Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re working in C using one of the major toolchains — that is,
it’s mainly a C++ implementation — and you need regular expressions. You
could integrate a library, but there’s a regex implementation in the C++
standard library included with your compiler, just within reach. As a
resourceful engineer, using an asset already in hand seems prudent. But
it’s a C++ interface, and you’re using C instead of C++ for a reason,
perhaps <em>to avoid dealing with C++</em>. Have no worries. This article is
about wrapping <a href="https://en.cppreference.com/w/cpp/regex"><code class="language-plaintext highlighter-rouge">std::regex</code></a> in a tidy C interface which not only
hides all the C++ machinery, but <em>utterly tames it</em>. It’s not so much
practical as a potpourri of interesting techniques.</p>

<p>If you’d like to skip ahead, here’s the full source up front. Tested with
<a href="https://github.com/skeeto/w64devkit">w64devkit</a>, MSVC <code class="language-plaintext highlighter-rouge">cl</code>, and <code class="language-plaintext highlighter-rouge">clang-cl</code>: <strong><a href="https://github.com/skeeto/scratch/tree/master/regex-wrap">scratch/regex-wrap</a></strong></p>

<h3 id="interface-design">Interface design</h3>

<p>The C interface I came up with, <code class="language-plaintext highlighter-rouge">regex.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span>
<span class="cp">#define S(s) (str){s, sizeof(s)-1}
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">str</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">regex</span> <span class="n">regex</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">str</span>      <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strlist</span><span class="p">;</span>

<span class="n">regex</span>  <span class="o">*</span><span class="nf">regex_new</span><span class="p">(</span><span class="n">str</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">strlist</span> <span class="nf">regex_match</span><span class="p">(</span><span class="n">regex</span> <span class="o">*</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Longtime readers will find it familiar: <a href="/blog/2023/10/08/">my favorite</a> non-owning,
counted strings form in place of null-terminated strings — similar to C++
<code class="language-plaintext highlighter-rouge">std::string_view</code> — and <a href="/blog/2023/09/27/">arena allocation</a>. Yes, such fundamental
types wouldn’t “belong” to a regex library like this, but imagine they’re
standardized by the project or whatever. Also, this is purely a C header,
not a C/C++ polyglot, and will not be used by the C++ portion.</p>

<p>In particular note the lack of “free” functions. <strong>The regex engine
allocates everything in the arena</strong>, including all temporary working
memory used while compiling, matching, etc. So in a sense, it could be
called <a href="/blog/2018/06/10/">a <em>non-allocating library</em></a>. This requires a bit of C++
abuse: I will not call some C++ regex destructors. It shouldn’t matter
because they only redundantly manage memory in the arena.  (If regex
objects are holding file handles or something else unnecessary then its
implementation so poor as to not be worth using, and we should just use a
better regex library.)</p>

<p>Now’s a good time to mention a caveat: In order to pull this off the regex
library lives in its own Dynamic-Link Library with its own copy of the C++
standard library, i.e. statically linked. My demo is Windows-only, but
this concept theoretically extends to shared objects on Linux. Since it’s
a C interface that doesn’t expose standard library objects, the DLL can be
used by programs compiled with different toolchains. Though that wouldn’t
apply to my inciting hypothetical.</p>

<p>Example usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">regex</span>  <span class="o">*</span><span class="n">re</span> <span class="o">=</span> <span class="n">regex_new</span><span class="p">(</span><span class="n">S</span><span class="p">(</span><span class="s">"(</span><span class="se">\\</span><span class="s">w+)"</span><span class="p">),</span> <span class="n">perm</span><span class="p">);</span>
<span class="n">str</span>     <span class="n">s</span>  <span class="o">=</span> <span class="n">S</span><span class="p">(</span><span class="s">"Hello, world! This is a test."</span><span class="p">);</span>
<span class="n">strlist</span> <span class="n">m</span>  <span class="o">=</span> <span class="n">regex_match</span><span class="p">(</span><span class="n">re</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%2td = %.*s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">m</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">len</span><span class="p">,</span> <span class="n">m</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This program prints:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 0 = Hello
 1 = world
 2 = This
 3 = is
 4 = a
 5 = test
</code></pre></div></div>

<p>If matching lots of source strings, scope the arena to the loop and then
the results, and any regex working memory, are automatically freed in O(1)
at the end of each iteration:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ninputs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">arena</span>   <span class="n">scratch</span> <span class="o">=</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
    <span class="n">strlist</span> <span class="n">matches</span> <span class="o">=</span> <span class="n">regex_match</span><span class="p">(</span><span class="n">re</span><span class="p">,</span> <span class="n">inputs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
    <span class="c1">// ... consume matches ...</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="c-implementation">C++ implementation</h3>

<p>On the C++ side the first thing I do is replace <code class="language-plaintext highlighter-rouge">new</code> and <code class="language-plaintext highlighter-rouge">delete</code>, which
is how I force it to allocate from the arena. This replaces <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code>
for <em>globally</em>, but recall that the regex library has its own, private C++
implementation. Replacements apply only to itself even if there’s other
C++ present in the process. If this is the only C++ in the process then it
doesn’t require such careful isolation.</p>

<p>I can’t tell <code class="language-plaintext highlighter-rouge">std::regex</code> about the arena — it calls <code class="language-plaintext highlighter-rouge">operator new</code> the
usual way, without extra arguments — so I have to smuggle it in through a
thread-local variable:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">thread_local</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
</code></pre></div></div>

<p>If I’m sure the library is only used by a single thread then I can omit
<code class="language-plaintext highlighter-rouge">thread_local</code>, but it’s useful here to demonstrate and measure. Using it
in my operator replacements:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="nf">new</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span> <span class="n">align</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">arena</span>    <span class="o">*</span><span class="n">a</span>     <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">ssize</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">pad</span>   <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">&amp;</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ssize</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">ssize</span> <span class="o">&gt;</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">bad_alloc</span><span class="p">{};</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-=</span> <span class="n">size</span> <span class="o">+</span> <span class="n">pad</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="k">new</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="k">operator</span> <span class="k">new</span><span class="p">(</span>
        <span class="n">size</span><span class="p">,</span>
        <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span><span class="p">(</span><span class="n">__STDCPP_DEFAULT_NEW_ALIGNMENT__</span><span class="p">)</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Starting in C++17, replacing the global allocator requires definitions for
both plain <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code> and aligned <code class="language-plaintext highlighter-rouge">new</code>/<code class="language-plaintext highlighter-rouge">delete</code>. The <a href="https://en.cppreference.com/w/cpp/memory/new/operator_new">many other
variants</a>, including arrays, call these four and so may be skipped.
Allocating over-aligned objects isn’t a special case for arenas, so I
implemented plain <code class="language-plaintext highlighter-rouge">new</code> by calling aligned <code class="language-plaintext highlighter-rouge">new</code>. I’d prefer to <a href="/blog/2024/04/14/">allocate
through a template</a> so that I can “see” the type, but that’s not an
option in this case.</p>

<p>After converting to signed sizes <a href="/blog/2024/05/24/">because they’re simpler</a>, it’s the
usual from-the-end allocation. I prefer <code class="language-plaintext highlighter-rouge">-fno-exceptions</code> but <code class="language-plaintext highlighter-rouge">std::regex</code>
is inherently <em>exceptional</em> — and I mean that in at least two bad ways —
so they’re required. The good news is this library gracefully and reliably
handles out-of-memory errors. (The arena makes this trivial to test, so
try it for yourself!)</p>

<p>I added a little extra flair replacing <code class="language-plaintext highlighter-rouge">delete</code>:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{}</span>
<span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">align_val_t</span><span class="p">)</span> <span class="k">noexcept</span> <span class="p">{}</span>

<span class="kt">void</span> <span class="k">operator</span> <span class="k">delete</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span> <span class="k">noexcept</span>
<span class="p">{</span>
    <span class="n">arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">perm</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">==</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The two mandatory replacements are no-ops because that’s simply how arenas
work. We don’t free individual objects, but many at once. It’s <em>completely
optional</em>, but I also replaced sized <code class="language-plaintext highlighter-rouge">delete</code> for little other reason than
<a href="/blog/2023/12/17/">sized deallocation is cool</a>. C++ destructs in reverse order, so
this is likely to work out. At least with GCC libstdc++, it freed about a
third of the workspace memory before returning to C. I’d rather it didn’t
try to free anything at all, but since it’s going to call <code class="language-plaintext highlighter-rouge">delete</code> anyway
I can get some use out of it.</p>

<p>Interesting side note: In a rough benchmark these replacements made MSVC
<code class="language-plaintext highlighter-rouge">std::regex</code> matching four times faster! I expected a <em>small</em> speedup, but
not that. In the typical case it appears to be wasting most of its time on
allocation. On the other hand, libstdc++ <code class="language-plaintext highlighter-rouge">std::regex</code> is overall quite a
bit slower than MSVC, and my replacements had no performance effect. It’s
spending its time elsewhere, and the small gains are lost interacting with
the thread-local.</p>

<p>Finally the meat:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="s">"C"</span> <span class="n">std</span><span class="o">::</span><span class="n">regex</span> <span class="o">*</span><span class="nf">regex_new</span><span class="p">(</span><span class="n">str</span> <span class="n">re</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">perm</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="n">std</span><span class="o">::</span><span class="n">regex</span><span class="p">(</span><span class="n">re</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">re</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">re</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">{};</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It sets the thread-local to the arena, then constructs with “iterators” at
each end of the input. All exceptions are caught and turned into a null
return. Depending on need, we may want to indicate <em>why</em> it failed — out
of memory, invalid regex, etc. — by returning an error value of some sort.
An exercise for the reader.</p>

<p>The matcher is a little more complicated:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="s">"C"</span> <span class="n">strlist</span> <span class="nf">regex_match</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">regex</span> <span class="o">*</span><span class="n">re</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">perm</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="n">std</span><span class="o">::</span><span class="n">cregex_iterator</span> <span class="n">it</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="o">*</span><span class="n">re</span><span class="p">);</span>
        <span class="n">std</span><span class="o">::</span><span class="n">cregex_iterator</span> <span class="n">end</span><span class="p">;</span>

        <span class="n">strlist</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{};</span>
        <span class="n">r</span><span class="p">.</span><span class="n">len</span>  <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">distance</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="n">end</span><span class="p">);</span>
        <span class="n">r</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="k">new</span> <span class="n">str</span><span class="p">[</span><span class="n">r</span><span class="p">.</span><span class="n">len</span><span class="p">]();</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">it</span> <span class="o">!=</span> <span class="n">end</span><span class="p">;</span> <span class="n">it</span><span class="o">++</span><span class="p">,</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">position</span><span class="p">();</span>
            <span class="n">r</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">len</span>  <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">length</span><span class="p">();</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>

    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="p">{};</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I create a <code class="language-plaintext highlighter-rouge">char *</code> “cregex” iterator, again giving it each end of the
input. I hope it’s not just making a copy (MSVC <code class="language-plaintext highlighter-rouge">std::regex</code> does <em>grumble
grumble</em>). The result is allocated out of the arena. As before, exceptions
convert to a null return. Callers can distinguish errors because no-match
results have a non-null pointer. The iterator, being a local variable, is
destroyed before returning, uselessly calling <code class="language-plaintext highlighter-rouge">delete</code>. I could avoid this
by allocating it with <code class="language-plaintext highlighter-rouge">new</code>, but in practice it doesn’t matter.</p>

<p>You might have noticed the lack of <code class="language-plaintext highlighter-rouge">declspec(dllexport)</code>. <a href="/blog/2023/08/27/">DEF files are
great</a>, and I’ve come to appreciate and prefer them. GCC and MSVC
accept them as another input on the command line, and the source need not
be aware exports. My <code class="language-plaintext highlighter-rouge">regex.def</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIBRARY regex
EXPORTS
regex_new
regex_match
</code></pre></div></div>

<p>In w64devkit, the command to build the DLL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ g++ -shared -std=c++17 -o regex.dll regex.cpp regex.def
</code></pre></div></div>

<p>The MSVC command almost maps 1:1 to the GCC command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /LD /std:c++17 /EHsc regex.cpp regex.def
</code></pre></div></div>

<p>In either case only the C interface is exported (via <a href="/blog/2024/06/30/">peports</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ peports -e regex.dll
EXPORTS
        1       regex_match
        2       regex_new
</code></pre></div></div>

<h3 id="reasons-against">Reasons against</h3>

<p>Though this library is conveniently on hand, and my minimalist C wrapper
interface is nicer than a typical C regex library interface, and even
hides some <code class="language-plaintext highlighter-rouge">std::regex</code> problems, trade-offs must be considered:</p>

<ul>
  <li>No Unicode support, particularly UTF-8</li>
  <li><code class="language-plaintext highlighter-rouge">std::regex</code> implementations are universally poor and slow</li>
  <li>libstdc++ <code class="language-plaintext highlighter-rouge">std::regex</code> is especially slow to compile</li>
  <li>Isolating in a DLL (if needed) is inconvenient</li>
  <li>DLL is 200K (MSVC) to 700K (GCC) or so</li>
</ul>

<p>Depending on what I’m doing, some of these may have me looking elsewhere.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Deep list copy: More than meets the eye</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/07/31/"/>
    <id>urn:uuid:1eb18920-bb29-4d8a-b9f5-2495d3eab697</id>
    <updated>2024-07-31T18:49:57Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>I recently came across a take-home C programming test which had more depth
and complexity than I suspect the interviewer intended. While considering
it, I also came up with a novel, or at least unconventional, solution. The
problem is to deep copy a linked list where each node references a random
list element in addition to usual linkage — similar to <a href="https://leetcode.com/problems/copy-list-with-random-pointer/">LeetCode problem
138</a>. This reference is one of identity rather than value, which has
murky consequences.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">node</span> <span class="n">node</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">node</span> <span class="p">{</span>
    <span class="n">node</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">node</span> <span class="o">*</span><span class="n">ref</span><span class="p">;</span>   <span class="c1">// arbitrary node in the list, or null</span>
<span class="p">};</span>

<span class="n">node</span> <span class="o">*</span><span class="nf">deepcopy</span><span class="p">(</span><span class="n">node</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>In the copy, nodes have individual lifetimes allocated using <code class="language-plaintext highlighter-rouge">malloc</code>
which the caller is responsible for freeing. <a href="https://www.youtube.com/watch?v=f4ioc8-lDc0&amp;t=4407s">While thickheaded</a>, this
is conventional, and I cannot blame the test’s designer for sticking to
familiar textbook concepts. My special solution handles this constraint in
stride. (In a well-written program the whole list would have <a href="/blog/2023/09/27/">a single
lifetime</a> likely shared with yet more objects.)</p>

<p>Ignoring <code class="language-plaintext highlighter-rouge">ref</code>, copying the normal list linkage is trivial. Walk the
original list, allocate a new node each iteration, and append it to the
result. The hard part is resolving <code class="language-plaintext highlighter-rouge">ref</code>. Given an arbitrary node pointer,
we must determine to which of the original list nodes it points, then find
the node at the matching position in the new list. Naively we could scan
the old list to search for a match:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">node</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">oldlist</span><span class="p">;</span>
<span class="n">node</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="n">newlist</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">old</span><span class="o">-&gt;</span><span class="n">ref</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">node</span> <span class="o">*</span><span class="n">findold</span> <span class="o">=</span> <span class="n">oldlist</span><span class="p">;</span>
        <span class="n">node</span> <span class="o">*</span><span class="n">findnew</span> <span class="o">=</span> <span class="n">newlist</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">old</span><span class="o">-&gt;</span><span class="n">ref</span> <span class="o">==</span> <span class="n">findold</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">new</span><span class="o">-&gt;</span><span class="n">ref</span> <span class="o">=</span> <span class="n">findnew</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="n">findold</span> <span class="o">=</span> <span class="n">findold</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
            <span class="n">findnew</span> <span class="o">=</span> <span class="n">findnew</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">old</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="n">new</span> <span class="o">=</span> <span class="n">new</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The nested loops are obviously quadratic time. That won’t scale well. To
do better we need some way to map, by identity, old list nodes onto new
list nodes. However, <a href="/blog/2016/05/30/">pointers do not necessarily have a value on which we
could key a map</a>. Other languages do not even expose such a concept,
or at least hide it behind some “unsafe” mechanism. In that case it seems
the best we could do is quadratic time.</p>

<h3 id="solution-by-temporary-mutation">Solution by temporary mutation</h3>

<p>If we’re free to <em>temporarily</em> modify the original list, then we can use
memory as a map. After all, memory itself is a kind of pointer-to-object
map! Since we only get one such map per process, we’ll need to commandeer
the original list during the copy. The trick is to interleave the two
lists when constructing the new list:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>old1 -&gt; new1 -&gt; old2 -&gt; new2 -&gt; ... -&gt; null
</code></pre></div></div>

<p>That might look like (note the double-skip per iteration):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">node</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">oldlist</span><span class="p">;</span> <span class="n">old</span><span class="p">;</span> <span class="n">old</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">node</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">new</span><span class="p">));</span>
    <span class="n">new</span><span class="o">-&gt;</span><span class="n">ref</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">new</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When we have a pointer to an old list node, the node itself points to the
matching new list node.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">node</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">oldlist</span><span class="p">;</span> <span class="n">old</span><span class="p">;</span> <span class="n">old</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">old</span><span class="o">-&gt;</span><span class="n">ref</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="o">-&gt;</span><span class="n">ref</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then before returning we’d need to deinterleave the lists, restoring the
old list and separating it from the result. This solution is linear time
and doesn’t require dealing with the concept of identity. Though modifying
the original list isn’t always possible. That won’t work if it’s accessed
concurrently — shared with another thread, accessed in a signal handler,
or something else reentrant — or if it’s in read-only memory.</p>

<h3 id="solution-by-intrusive-hash-map">Solution by intrusive hash map</h3>

<p>If we can obtain a stable value from a pointer, i.e. <code class="language-plaintext highlighter-rouge">uintptr_t</code> — in
practice virtually always true — then there’s an interesting <code class="language-plaintext highlighter-rouge">O(n log n)</code>
solution using an <a href="/blog/2023/09/30/">intrusive map</a> which doesn’t modify the original
list. This is my own novel solution. The result will be simultaneously a
linked list and a hash map, and the caller won’t even know it! Because the
map is built into the list, with a caller-managed lifetime, we won’t free
anything before returning.</p>

<p>To start, linked list nodes are embedded at the front of hash trie nodes.
The caller will see this initial field, but not the hash trie fields.
Being at the front, the caller can still <code class="language-plaintext highlighter-rouge">free</code> them by this “internal”
pointer, which allows the hash trie to be invisible.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">map</span> <span class="n">map</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">map</span> <span class="p">{</span>
    <span class="n">node</span>  <span class="n">new</span><span class="p">;</span>
    <span class="n">map</span>  <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">node</span> <span class="o">*</span><span class="n">old</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The “key” is <code class="language-plaintext highlighter-rouge">old</code> and the “value” is <code class="language-plaintext highlighter-rouge">new</code>. Lookup and insert use the
usual “upsert” construction oriented around zero-initialization:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">node</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">map</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">node</span> <span class="o">*</span><span class="n">old</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">old</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// map null to null</span>
    <span class="p">}</span>

    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">old</span> <span class="o">*</span> <span class="mi">1111111111111111111ull</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">hash</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">old</span> <span class="o">==</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">old</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">new</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>

    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">map</span><span class="p">));</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">old</span> <span class="o">=</span> <span class="n">old</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">new</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the matching node doesn’t yet exist, the function creates it. Also note
how it returns an internal pointer. With “upsert” semantics, loop copying
is trivialized:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">node</span> <span class="o">*</span><span class="nf">deepcopy</span><span class="p">(</span><span class="n">node</span> <span class="o">*</span><span class="n">head</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">map</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">node</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span> <span class="n">old</span><span class="p">;</span> <span class="n">old</span> <span class="o">=</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">node</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m</span><span class="p">,</span> <span class="n">old</span><span class="p">);</span>
        <span class="n">new</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m</span><span class="p">,</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">);</span>
        <span class="n">new</span><span class="o">-&gt;</span><span class="n">ref</span>  <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m</span><span class="p">,</span> <span class="n">old</span><span class="o">-&gt;</span><span class="n">ref</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m</span><span class="p">,</span> <span class="n">head</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These easy-to-implement hash tries continue to be generally useful and
elegant, even with traditional memory management. Cloneable, runnable
source with tests is <a href="https://gist.github.com/skeeto/9aedc59629de75c07a9533dcfb83af66">available as a gist</a> if you’d like to play
around with it yourself.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Arenas and the almighty concatenation operator</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/05/25/"/>
    <id>urn:uuid:e88784ce-08fb-40d2-b6ad-c3d9af3cf5bc</id>
    <updated>2024-05-25T00:00:00Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I continue to streamline <a href="/blog/2023/09/27/">an arena-based paradigm</a>, and stumbled
upon a concise technique for dynamic growth — an efficient, generic
“concatenate anything to anything” within an arena built atop a core of
9-ish lines of code. The key insight originated from a reader suggestion
about <a href="/blog/2023/10/05/">dynamic arrays</a>. The subject of concatenation can be a string,
dynamic array, or even something else. The “system” is extensible, and
especially useful for path handling.</p>

<p>Continuing <a href="/blog/2024/04/14/">from last time</a>, the examples are in light, C-style C++.
I chose it because templates and function overloading express the concepts
succinctly. It uses no standard library functionality, so converting to C,
or similar, should be straightforward. The core concatenation “operator”:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="n">T</span> <span class="nf">concat</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">T</span> <span class="n">head</span><span class="p">,</span> <span class="n">T</span> <span class="n">tail</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)(</span><span class="n">head</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">head</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="o">!=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">T</span><span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">head</span><span class="p">};</span>
    <span class="p">}</span>
    <span class="n">head</span><span class="p">.</span><span class="n">len</span> <span class="o">+=</span> <span class="n">T</span><span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">tail</span><span class="p">}.</span><span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This concatenates two objects of the same type in the arena, and does so
<em>in place</em> if possible. That is, we can efficiently build a value piece by
piece. The type <code class="language-plaintext highlighter-rouge">T</code> must have <code class="language-plaintext highlighter-rouge">data</code> and <code class="language-plaintext highlighter-rouge">len</code> members, and a “copy”
constructor that makes a copy of the given object at <em>the front of the
arena</em>. Size integer overflows and out-of-memory errors are, as usual,
handled by the arena. In particular, note that the <code class="language-plaintext highlighter-rouge">len</code> addition happens
after allocation.</p>

<p>Since the front-of-the-arena business implicit, consider <code class="language-plaintext highlighter-rouge">assert</code>ing it if
you’re worried. I’ve also considered declaring a <code class="language-plaintext highlighter-rouge">clone</code> “operator” where
that behavior is an explicit part of its interface.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Make a copy of the object at the front of the arena.</span>
<span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span> <span class="n">T</span> <span class="nf">clone</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">T</span><span class="p">);</span>

<span class="c1">// In concat, replace the T{} constructors with clone:</span>
    <span class="n">head</span> <span class="o">=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">head</span><span class="p">);</span>
    <span class="n">head</span><span class="p">.</span><span class="n">len</span> <span class="o">+=</span> <span class="n">clone</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">tail</span><span class="p">).</span><span class="n">len</span><span class="p">;</span>
</code></pre></div></div>

<p>Strings are perhaps them most interesting subject of concatenation. Here’s
a compatible string, <code class="language-plaintext highlighter-rouge">str</code>, definition from my previous article:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">str</span> <span class="p">{</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="kt">uint8_t</span>    <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="kt">char</span> <span class="k">const</span> <span class="o">*</span><span class="n">cdata</span><span class="p">;</span>
    <span class="p">};</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">str</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>

    <span class="n">str</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">beg</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">end</span><span class="p">)</span> <span class="o">:</span> <span class="n">data</span><span class="p">{</span><span class="n">beg</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">end</span><span class="o">-</span><span class="n">beg</span><span class="p">}</span> <span class="p">{}</span>

    <span class="k">template</span><span class="o">&lt;</span><span class="kt">ptrdiff_t</span> <span class="n">N</span><span class="p">&gt;</span>
    <span class="k">constexpr</span> <span class="n">str</span><span class="p">(</span><span class="kt">char</span> <span class="k">const</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">)[</span><span class="n">N</span><span class="p">])</span> <span class="o">:</span> <span class="n">cdata</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">}</span> <span class="p">{}</span>

    <span class="n">str</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">str</span><span class="p">);</span>  <span class="c1">// TODO</span>

    <span class="kt">uint8_t</span> <span class="o">&amp;</span><span class="k">operator</span><span class="p">[](</span><span class="kt">ptrdiff_t</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This has <code class="language-plaintext highlighter-rouge">data</code>, <code class="language-plaintext highlighter-rouge">len</code>, and the necessary constructor declaration. Before
showing the constructor definition, here’s an arena following the usual
formula, which should be familiar to those who’ve been following along:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">arena</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">,</span> <span class="k">typename</span> <span class="o">...</span><span class="nc">A</span><span class="p">&gt;</span>
<span class="n">T</span> <span class="o">*</span><span class="n">makefront</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">A</span> <span class="p">...</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">size</span>  <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
    <span class="kt">ptrdiff_t</span> <span class="n">align</span> <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">&amp;</span> <span class="p">(</span><span class="k">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">align</span><span class="p">)</span><span class="o">/</span><span class="n">size</span><span class="p">);</span>  <span class="c1">// OOM</span>
    <span class="n">T</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="n">T</span> <span class="o">*</span><span class="p">)(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+</span> <span class="n">align</span><span class="p">);</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+=</span> <span class="n">align</span> <span class="o">+</span> <span class="n">size</span><span class="o">*</span><span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">new</span> <span class="p">(</span><span class="n">r</span><span class="o">+</span><span class="n">i</span><span class="p">)</span> <span class="n">T</span><span class="p">(</span><span class="n">args</span><span class="p">...);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how it bumps <code class="language-plaintext highlighter-rouge">beg</code>, not <code class="language-plaintext highlighter-rouge">end</code>, because it’s allocated at the front.
That opens the end of the object for concatenation. When it returns, <code class="language-plaintext highlighter-rouge">beg</code>
points just past the end of the new object, aligned to it. Later, <code class="language-plaintext highlighter-rouge">concat</code>
inspects <code class="language-plaintext highlighter-rouge">beg</code> to see if it can <em>extend in place</em>. That will be true if
nothing else has been allocated <em>at the front</em> in the meantime. That is,
we can allocate objects <em>at the end</em> — such as <a href="/blog/2023/09/30/">hash map nodes</a> —
while efficiently growing an object at the front through concatenation. If
it’s not true for whatever reason, concatenation still works, just with
reduced efficiency.</p>

<p>With that out of the way, the “copy” constructor is simple:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span><span class="o">::</span><span class="n">str</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">makefront</span><span class="o">&lt;</span><span class="kt">uint8_t</span><span class="o">&gt;</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
    <span class="n">len</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s everything we need to put it into action. For example, a function
that deletes a file at a path following a path template.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">tocstr</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">str</span><span class="p">{</span><span class="s">"</span><span class="se">\0</span><span class="s">"</span><span class="p">}).</span><span class="n">data</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">bool</span> <span class="n">removeconfig</span><span class="p">(</span><span class="n">str</span> <span class="n">home</span><span class="p">,</span> <span class="n">str</span> <span class="n">program</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">str</span> <span class="n">path</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">home</span><span class="p">);</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">str</span><span class="p">{</span><span class="s">"/.config/"</span><span class="p">});</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">program</span><span class="p">);</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">str</span><span class="p">{</span><span class="s">"/rc"</span><span class="p">});</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">unlink</span><span class="p">(</span><span class="n">tocstr</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First, <code class="language-plaintext highlighter-rouge">concat</code> does all the heavy lifting in a null-terminated “C string”
conversion function that operates in place if possible. In <code class="language-plaintext highlighter-rouge">removeconfig</code>
I construct a path from path components, starting from a zero-initialized
<em>null string</em>. In the first <code class="language-plaintext highlighter-rouge">concat</code>, this null string is “copied” into
the arena, laying a foundation for additional concatenations. Each path
component is copied in place, so unlike <a href="/blog/2021/07/30/">a dumb <code class="language-plaintext highlighter-rouge">strcat</code></a>, it’s not
quadratic.</p>

<p>Even more, notice it supports arbitrary path lengths. No <code class="language-plaintext highlighter-rouge">PATH_MAX</code>,
<code class="language-plaintext highlighter-rouge">MAX_PATH</code>, etc., it grows into the arena as needed. No <a href="/blog/2024/02/05/">huge stack
variables</a> necessary, and the scratch arena automatically frees
the path on return. Fancier yet, imagine a variadic function that glues
path components together with the proper path delimiter, and it wouldn’t
involve <a href="/blog/2024/05/24/">a single, error-prone size calculation</a>.</p>

<p>The <code class="language-plaintext highlighter-rouge">str{}</code> business is unfortunate. The <code class="language-plaintext highlighter-rouge">char</code> array constructor normally
kicks in in these situations, but compilers can’t resolve the template
without an explicit <code class="language-plaintext highlighter-rouge">str</code> object. Perhaps there’s a workaround, but I’m
not yet savvy enough with C++ to figure it out. In the C version you’d
always need to wrap those literals in the string macro.</p>

<h3 id="extending-concatenation">Extending concatenation</h3>

<p>The “operator” can be extended by defining more overloads. For example, to
concatenate 32-bit integers to a string:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span> <span class="nf">concat</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint8_t</span>  <span class="n">buf</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="n">countof</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">end</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">neg</span> <span class="o">=</span> <span class="n">x</span><span class="o">&lt;</span><span class="mi">0</span> <span class="o">?</span> <span class="n">x</span> <span class="o">:</span> <span class="o">-</span><span class="n">x</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">-</span> <span class="n">neg</span><span class="o">%</span><span class="mi">10</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">neg</span> <span class="o">/=</span> <span class="mi">10</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'-'</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="p">{</span><span class="n">beg</span><span class="p">,</span> <span class="n">end</span><span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now we can, say, construct a randomly-generated temporary path:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span> <span class="n">path</span> <span class="o">=</span> <span class="p">{};</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">tempdir</span><span class="p">);</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">str</span><span class="p">{</span><span class="s">"/temp"</span><span class="p">});</span>
<span class="kt">int32_t</span> <span class="n">id</span> <span class="o">=</span> <span class="n">rand32</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">);</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
</code></pre></div></div>

<p>Keep adding more definitions like this and you’ll have something like, or
complementing, <a href="/blog/2023/02/13/">buffered output</a>. It doesn’t stop there. Code points
concatenated as UTF-8:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span> <span class="nf">concat</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">s</span><span class="p">,</span> <span class="kt">char32_t</span> <span class="n">rune</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">enum</span> <span class="p">{</span> <span class="n">REPLACEMENT_CHARACTER</span> <span class="o">=</span> <span class="mh">0xfffd</span> <span class="p">};</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">rune</span><span class="o">&gt;=</span><span class="mh">0xd800</span> <span class="o">&amp;&amp;</span> <span class="n">rune</span><span class="o">&lt;=</span><span class="mh">0xdfff</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">rune</span> <span class="o">=</span> <span class="n">REPLACEMENT_CHARACTER</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">uint8_t</span>  <span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">rune</span> <span class="o">&lt;</span> <span class="mh">0x80</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">rune</span><span class="p">;</span>
        <span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">rune</span> <span class="o">&lt;</span> <span class="mh">0x800</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span>  <span class="p">(</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">6</span><span class="p">)</span>         <span class="o">|</span> <span class="mh">0xc0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">rune</span> <span class="o">&lt;</span> <span class="mh">0x10000</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span>  <span class="p">(</span><span class="n">rune</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span>         <span class="o">|</span> <span class="mh">0xe0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">3</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span>  <span class="p">(</span><span class="n">rune</span> <span class="o">&gt;&gt;</span> <span class="mi">18</span><span class="p">)</span>         <span class="o">|</span> <span class="mh">0xf0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">rune</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">;</span>
        <span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">4</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="p">{</span><span class="n">buf</span><span class="p">,</span> <span class="n">end</span><span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That composes well for general UTF-8 handling. For example, to ingest
Win32 strings (arguments, paths, etc.):</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span> <span class="nf">convert</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">,</span> <span class="kt">char16_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">str</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{};</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">char32_t</span> <span class="n">rune</span> <span class="o">=</span> <span class="n">decode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">);</span>
        <span class="n">r</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">rune</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="beyond-strings">Beyond strings</h3>

<p>One of my most useful C++ templates has been a span structure:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">struct</span> <span class="nc">span</span> <span class="p">{</span>
    <span class="n">T</span>        <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">span</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>

    <span class="n">span</span><span class="p">(</span><span class="n">T</span> <span class="o">*</span><span class="n">beg</span><span class="p">,</span> <span class="n">T</span> <span class="o">*</span><span class="n">end</span><span class="p">)</span> <span class="o">:</span> <span class="n">data</span><span class="p">{</span><span class="n">beg</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">end</span><span class="o">-</span><span class="n">beg</span><span class="p">}</span> <span class="p">{}</span>

    <span class="n">span</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">span</span><span class="p">);</span>  <span class="c1">// for concat</span>

    <span class="n">T</span> <span class="o">&amp;</span><span class="k">operator</span><span class="p">[](</span><span class="kt">ptrdiff_t</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">span::span</code> definition looks exactly like <code class="language-plaintext highlighter-rouge">str::str</code>. In fact, we
could nearly define strings as <code class="language-plaintext highlighter-rouge">uint8_t</code> spans:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="n">span</span><span class="o">&lt;</span><span class="kt">uint8_t</span><span class="o">&gt;</span> <span class="n">str</span><span class="p">;</span>  <span class="c1">// hypothetical</span>
</code></pre></div></div>

<p>Though I’ve found strings to be just special enough not to be worth it.</p>

<p>This <code class="language-plaintext highlighter-rouge">span</code> definition is now fleshed out sufficiently to use <code class="language-plaintext highlighter-rouge">concat</code>
with no additional definitions! However, outside of strings, concatenating
spans is unusual. More often we want to append individual elements. Again,
we can build on that core <code class="language-plaintext highlighter-rouge">concat</code> template:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="n">span</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">concat</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">span</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">s</span><span class="p">,</span> <span class="n">T</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">concat</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">span</span><span class="p">{</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">v</span><span class="o">+</span><span class="mi">1</span><span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now <code class="language-plaintext highlighter-rouge">span</code> is ready for 99% of its use cases. For example:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">span</span><span class="o">&lt;</span><span class="kt">int32_t</span><span class="o">&gt;</span> <span class="n">squares</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;=</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">squares</span> <span class="o">=</span> <span class="n">concat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">squares</span><span class="p">,</span> <span class="n">i</span><span class="o">*</span><span class="n">i</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>It’s often good enough, but it’s not ideal as a general purpose dynamic
array. Each append makes a trip through arena allocation, and this span
cannot efficiently shrink and then grow again. Sometimes we’d like to
track capacity, covering both those cases.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">struct</span> <span class="nc">list</span> <span class="p">{</span>
    <span class="n">T</span>        <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">list</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>

    <span class="n">list</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">list</span><span class="p">);</span>  <span class="c1">// for concat</span>

    <span class="n">T</span> <span class="o">&amp;</span><span class="k">operator</span><span class="p">[](</span><span class="kt">ptrdiff_t</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Unfortunately <code class="language-plaintext highlighter-rouge">cap</code> is a curve ball that the core template can’t handle,
requiring a slightly more complex definition. Since concatenating whole
<code class="language-plaintext highlighter-rouge">list</code> objects is unusual, a definition for appending single elements:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="n">list</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">concat</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">list</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">s</span><span class="p">,</span> <span class="n">T</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">len</span> <span class="o">==</span> <span class="n">s</span><span class="p">.</span><span class="n">cap</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="o">+</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="o">!=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">s</span> <span class="o">=</span> <span class="n">list</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">{</span><span class="n">a</span><span class="p">,</span> <span class="n">s</span><span class="p">};</span>
        <span class="p">}</span>
        <span class="kt">ptrdiff_t</span> <span class="n">extend</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">cap</span> <span class="o">?</span> <span class="n">s</span><span class="p">.</span><span class="n">cap</span> <span class="o">:</span> <span class="mi">4</span><span class="p">;</span>
        <span class="n">makefront</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">extend</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>
        <span class="n">s</span><span class="p">.</span><span class="n">cap</span> <span class="o">+=</span> <span class="n">extend</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">s</span><span class="p">[</span><span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how inside the <code class="language-plaintext highlighter-rouge">if</code> it’s basically the same core definition. As
before, this definition extends in place if possible, but otherwise
handles it correctly anyway. In addition the above concerns, this <code class="language-plaintext highlighter-rouge">list</code>
is more suited to having multiple “open” dynamic arrays at once.</p>

<p>This concatenative concept has been a useful way to think about a variety
of situations in order to solve them effectively with arena allocation.</p>

<p><strong>Update</strong>: NRK <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cane2ee7fpnyn3qxslygprmjw2yrvzppxuim25jvf7e6f5jgxbd@p7y6own2j3it%3E#%3C2qzyqky3jtv6w64vicwnkrwa7nb52uohuu625bc3zrkaoor6ml@v57pb72uozpy%3E">sharply points out</a> that “extend in place” as
expressed in <code class="language-plaintext highlighter-rouge">concat</code> is incompatible with the <a href="https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html"><code class="language-plaintext highlighter-rouge">alloc_size</code> and <code class="language-plaintext highlighter-rouge">malloc</code>
GCC function attributes</a>, which I’ve suggested in the past. While
considering how to mitigate this, we’ve also discovered that <code class="language-plaintext highlighter-rouge">alloc_size</code>
has always been fundamentally broken in GCC. Correct use is impossible,
and so <em>it must not be used</em>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Guidelines for computing sizes and subscripts</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/05/24/"/>
    <id>urn:uuid:df6214e0-e408-4254-bd65-49d64e06a93e</id>
    <updated>2024-05-24T22:25:10Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p>Occasionally we need to compute the size of an object that does not yet
exist, or a subscript <a href="https://research.google/blog/extra-extra-read-all-about-it-nearly-all-binary-searches-and-mergesorts-are-broken/">that may fall out of bounds</a>. It’s easy to miss
the edge cases where results overflow, creating a nasty, subtle bug, <a href="https://blog.carlana.net/post/2024/golang-slices-concat/">even
in the presence of type safety</a>. Ideally such computations happen in
specialized code, such as <em>inside</em> an allocator (<code class="language-plaintext highlighter-rouge">calloc</code>, <code class="language-plaintext highlighter-rouge">reallocarray</code>)
and not <em>outside</em> by the allocatee (i.e. <code class="language-plaintext highlighter-rouge">malloc</code>). Mitigations exist with
different trade-offs: arbitrary precision, or using a wider fixed integer
— i.e. 128-bit integers on 64-bit hosts. In the typical case, working only
with fixed size-type integers, I’ve come up with a set of guidelines to
avoid overflows in the edge cases.</p>

<ol>
  <li>Range check <em>before</em> computing a result. No exceptions.</li>
  <li>Do not cast unless you know <em>a priori</em> the operand is in range.</li>
  <li>Never mix unsigned and signed operands. <a href="https://www.youtube.com/watch?v=wvtFGa6XJDU">Prefer signed.</a> If you
need to convert an operand, see (2).</li>
  <li>Do not add unless you know <em>a priori</em> the result is in range.</li>
  <li>Do not multiply unless you know <em>a priori</em> the result is in range.</li>
  <li>Do not subtract unless you know <em>a priori</em> both signed operands
are non-negative. For unsigned, that the second operand is not larger
than the first (treat it like (4)).</li>
  <li>Do not divide unless you know <em>a prior</em> the denominator is positive.</li>
  <li>Make it correct first. Make it fast later, if needed.</li>
</ol>

<p>These guidelines are also useful when <em>reviewing</em> code, tracking in your
mind whether the invariants are held at each step. If not, you’ve likely
found a bug. If in doubt, use assertions to document and check invariants.
I compiled this list during code review, so for me that’s where it’s most
useful.</p>

<h3 id="range-check-then-compute">Range check, then compute</h3>

<p>Not strictly necessary when overflow is well-defined, i.e. wraparound, but
it’s like defensive driving. It’s simpler and clearer to check with basic
arithmetic rather than reason from a wraparound, i.e. a negative result.
Checked math functions are fine, too, if you check the overflow boolean
before accessing the result.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// bad
len++;
if (len &lt;= 0) error();

// good
if (len == MAX) error();
len++;
</code></pre></div></div>

<h3 id="casting">Casting</h3>

<p>Casting from signed to unsigned, it’s as simple as knowing the value is
non-negative, which is likely if you’re following (1). If a negative size
has appeared, there’s already been a bug earlier in the program, and the
only reasonable course of action is to abort, not handle it like an error.</p>

<h3 id="addition">Addition</h3>

<p>To check if addition will overflow, subtract one of the operands from the
maximum value.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (b &gt; MAX - a) error();
r = a + b;
</code></pre></div></div>

<p>In pointer arithmetic addition, it’s a common mistake to compute the
result pointer then compare it to the bounds. If the check failed, then
the pointer <em>already</em> overflowed, i.e. undefined behavior. Major pieces
software, <a href="https://sourcegraph.com/search?q=context:global+%22%3E+outend%22+repo:%5Egithub%5C.com/bminor/glibc%24+&amp;patternType=keyword&amp;sm=0">like glibc</a>, are riddled with such pointer overflows.
(Now that you’re aware of it, you’ll start noticing it everywhere. Sorry.)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// bad: never do this
beg += size;
if (beg &gt; end) error();
</code></pre></div></div>

<p>To do this correctly, <strong>check integers not pointers</strong>. Like before,
subtract before adding.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>available = end - beg;
if (size &gt; available) error();
beg += size;
</code></pre></div></div>

<p>Mind mixing signed and unsigned operands for the comparison operator (3),
e.g. an unsigned size on the left and signed difference on the right.</p>

<h3 id="multiplication-and-division">Multiplication and division</h3>

<p>If you’re working this out on your own, multiplication seems tricky until
you’ve internalized a simple pattern. Just as we subtracted before adding,
we need to divide before multiplying. Divide the maximum value by one of
the operands:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (a&gt;0 &amp;&amp; b&gt;MAX/a) error();
r = a * b;
</code></pre></div></div>

<p>It’s often permitted for one or both to be zero, so mind divide-by-zero,
which is handled above by the first condition. Sometimes size must be
positive, e.g. the result of the <code class="language-plaintext highlighter-rouge">sizeof</code> operator in C, in which case we
should prefer it as the denominator.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>assert(size  &gt;  0);
assert(count &gt;= 0);
if (count &gt; MAX/size) error();
total = count * size;
</code></pre></div></div>

<p>With <a href="/blog/2023/09/27/">arena allocation</a> there are usually two concerns. First, will
it overflow when computing the total size, i.e. <code class="language-plaintext highlighter-rouge">count * size</code>? Second, is
the total size within the arena capacity. Naively that’s two checks, but
we can kill two birds with one stone: Check both at once by using the
current arena capacity as the maximum value when considering overflow.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (count &gt; (end - beg)/size) error();
total = count * size;
</code></pre></div></div>

<p>One condition pulling double duty.</p>

<h3 id="subtraction">Subtraction</h3>

<p>With signed sizes, the negative range is a long “runway” allowing a single
unchecked subtraction before overflow might occur. In essence, we were
exploiting this in order to check addition. The most common mistake with
unsigned subtraction is not accounting for overflow when going below zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// note: signed "i" only
for (i = end - stride; i &gt;= beg; i -= stride) ...
</code></pre></div></div>

<p>This loop will go awry if <code class="language-plaintext highlighter-rouge">i</code> is unsigned and <code class="language-plaintext highlighter-rouge">beg &lt;= stride</code>.</p>

<p>In special cases we can get away with a second subtraction without an
overflow check if we know some properties of our operands. For example, my
arena allocators look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>padding = -beg &amp; (align - 1);
if (count &gt;= (end - beg - padding)/size) error();
</code></pre></div></div>

<p>That’s two subtractions in a row. However, <code class="language-plaintext highlighter-rouge">end - beg</code> describes the size
of a realized object, and <code class="language-plaintext highlighter-rouge">align</code> is a small constant (e.g. 2^(0–6)). It
could only overflow if the entirety of memory was occupied by the arena.</p>

<p>Bonus, advanced note: This check is actually pulling <em>triple duty</em>. Notice
that I used <code class="language-plaintext highlighter-rouge">&gt;=</code> instead of <code class="language-plaintext highlighter-rouge">&gt;</code>. The arena can’t fill exactly to the brim,
but it handles the extreme edge case where <code class="language-plaintext highlighter-rouge">count</code> is zero, the arena is
nearly full, but the bump pointer is unaligned. The result of subtracting
<code class="language-plaintext highlighter-rouge">padding</code> is negative, which rounds to zero by integer division, and would
pass a <code class="language-plaintext highlighter-rouge">&gt;</code> check. That wouldn’t be a problem except that aligning the bump
pointer would break the invariant <code class="language-plaintext highlighter-rouge">beg &lt;= end</code>.</p>

<h3 id="try-it-for-yourself">Try it for yourself</h3>

<p>Next time you’re reviewing code that computes sizes or subscripts, bring
the list up and see how well it follows the guidelines. If it misses one,
try to contrive an input that causes an overflow. If it follows guidelines
and you can still contrive such an input, then perhaps the list could use
another item!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Speculations on arenas and custom strings in C++</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/04/14/"/>
    <id>urn:uuid:6b07a406-b303-4c2b-8afd-3e589b26eaa1</id>
    <updated>2024-04-14T00:39:18Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p><em>Update September 2025: This article <a href="/blog/2025/09/30/">has a followup</a> with
corrections.</em></p>

<p>My techniques with <a href="/blog/2023/09/27/">arena allocation</a> and <a href="/blog/2023/10/08/">strings</a> are
oriented around C. I’m always looking for a better way, and lately I’ve
been experimenting with building them using C++ features. What are the
trade-offs? Are the benefits worth the costs? In this article I lay out my
goals, review implementation possibilities, and discuss my findings.
Following along will require familiarity with those previous two articles.</p>

<!--more-->

<p>Some of C++ is beyond my mental capabilities, and so I cannot wield those
parts effectively. Other parts I <em>can</em> wrap my head around, but it
requires substantial effort and the inevitable mistakes are difficult to
debug. So a general goal is to minimize contact with that complexity, only
touching a few higher-value features that I can use confidently.</p>

<p>Existing practice is unimportant. I’ve seen where that goes. <a href="/blog/2023/02/11/">Like the C
standard library</a>, the C++ standard library offers me little. Its
concepts regarding ownership and memory management are irreconcilable
(move semantics, smart pointers, etc.), so I have to build from scratch
anyway. So absolutely no including C++ headers. The most valuable features
are built right into the language, so I won’t need to include library
definitions.</p>

<p>No <a href="https://www.youtube.com/watch?v=uHSLHvWFkto&amp;t=4386s"><code class="language-plaintext highlighter-rouge">public</code> or <code class="language-plaintext highlighter-rouge">private</code></a>. Still no <code class="language-plaintext highlighter-rouge">const</code> beyond what is required
to access certain features. This means I can toss out a bunch of keywords
like <code class="language-plaintext highlighter-rouge">class</code>, <code class="language-plaintext highlighter-rouge">friend</code>, etc. It eliminates noisy, repetitive code and
interfaces — getters, setters, separate <code class="language-plaintext highlighter-rouge">const</code> and non-<code class="language-plaintext highlighter-rouge">const</code> — which in
my experience means fewer defects.</p>

<p>No references beyond mandatory cases. References hide addresses being
taken — or merely implies it, when it’s actually an expensive copy — which
is an annoying experience when reading unfamiliar C++. After all, for
arenas the explicit address-taking (permanent) or copying (scratch) is a
critical part of communicating the interfaces.</p>

<p>In theory <code class="language-plaintext highlighter-rouge">constexpr</code> could be useful, but it keeps falling short when I
try it out, so I’m ignoring it. I’ll elaborate in a moment.</p>

<p>Minimal template use. They blow up compile times and code size, they’re
noisy, and in practice they make debug builds (i.e. <code class="language-plaintext highlighter-rouge">-O0</code>) much slower
(typically ~10x) because there’s no optimization to clean up the mess.
I’ll only use them for a few foundational purposes, such as allocation.
(Though this article <em>is</em> about the fundamental stuff.)</p>

<p>No methods aside from limited use of operator overloads. I want to keep a
C style, plus methods just look ugly without references: <code class="language-plaintext highlighter-rouge">obj-&gt;func()</code> vs.
<code class="language-plaintext highlighter-rouge">func(obj)</code>. (Why are we still writing <code class="language-plaintext highlighter-rouge">-&gt;</code> in the 21st century?) Function
overloading can instead differentiate “methods.” Overloads are acceptable
in moderation, especially because I’m paying for it (symbol decoration)
whether or not I take advantage.</p>

<p>Finally, no exceptions of course. I assume <code class="language-plaintext highlighter-rouge">-fno-exceptions</code>, or the local
equivalent, is active.</p>

<h3 id="allocation">Allocation</h3>

<p>Let’s start with allocation. Since writing that previous article, I’ve
streamlined arena allocation in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">byte</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="n">byte</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="k">static</span> <span class="n">byte</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">size</span> <span class="n">objsize</span><span class="p">,</span> <span class="n">size</span> <span class="n">align</span><span class="p">,</span> <span class="n">size</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">size</span> <span class="n">pad</span> <span class="o">=</span> <span class="p">(</span><span class="n">uptr</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span><span class="o">/</span><span class="n">objsize</span><span class="p">);</span>  <span class="c1">// oom</span>
    <span class="k">return</span> <span class="n">memset</span><span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-=</span> <span class="n">objsize</span><span class="o">*</span><span class="n">count</span> <span class="o">+</span> <span class="n">pad</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">objsize</span><span class="o">*</span><span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(As needed, replace the second <code class="language-plaintext highlighter-rouge">assert</code> with whatever out of memory policy
is appropriate.) Then allocating, say, a <a href="/blog/2023/06/26/">10k-element hash table</a>
(i.e. to keep it <a href="/blog/2024/02/05/">off the stack</a>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">i16</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="n">i16</span><span class="p">,</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">);</span>
</code></pre></div></div>

<p>With C++, I initially tried <a href="https://en.cppreference.com/w/cpp/language/new#Placement_new">placement new</a> with the arena as the
“place” for the allocation:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="nf">new</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// avoid this</span>
</code></pre></div></div>

<p>Then to create a single object:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">object</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="k">new</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="n">object</span><span class="p">{};</span>
</code></pre></div></div>

<p>This exposes the constructor, but everything else about it is poor. It
relies on complex, finicky rules governing <code class="language-plaintext highlighter-rouge">new</code> overloads, especially for
alignment handling. It’s difficult to tell what’s happening, and it’s too
easy to make mistakes that compile. That doesn’t even count the mess that
is array <code class="language-plaintext highlighter-rouge">new[]</code>.</p>

<p>I soon learned it’s better to replace the <code class="language-plaintext highlighter-rouge">new</code> macro with a template,
which can actually see what it’s doing. I can’t call it <code class="language-plaintext highlighter-rouge">new</code> in C++, so I
settled on <code class="language-plaintext highlighter-rouge">make</code> instead:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">static</span> <span class="n">T</span> <span class="o">*</span><span class="nf">make</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">size</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">size</span> <span class="n">objsize</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
    <span class="n">size</span> <span class="n">align</span>   <span class="o">=</span> <span class="k">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
    <span class="n">size</span> <span class="n">pad</span>     <span class="o">=</span> <span class="p">(</span><span class="n">uptr</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">pad</span><span class="p">)</span><span class="o">/</span><span class="n">objsize</span><span class="p">);</span>  <span class="c1">// oom</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-=</span> <span class="n">objsize</span><span class="o">*</span><span class="n">count</span> <span class="o">+</span> <span class="n">pad</span><span class="p">;</span>
    <span class="n">T</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="n">T</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">size</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">new</span> <span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">r</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="n">T</span><span class="p">{};</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then allocating that hash table becomes:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">i16</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="n">make</span><span class="o">&lt;</span><span class="n">i16</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="mi">10000</span><span class="p">);</span>
</code></pre></div></div>

<p>Or a single object, relying on the default argument:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">object</span> <span class="o">*</span><span class="n">o</span> <span class="o">=</span> <span class="n">make</span><span class="o">&lt;</span><span class="n">object</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
</code></pre></div></div>

<p>Due to placement new, merely for invoking the constructor, these objects
aren’t just zero-initialized, but value-initialized. It can only construct
objects that define an empty initializer, but in exchange unlocks some
interesting possibilities:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">mat3</span> <span class="p">{</span>
    <span class="n">f32</span> <span class="n">data</span><span class="p">[</span><span class="mi">9</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
    <span class="p">};</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="nc">list</span> <span class="p">{</span>
    <span class="n">node</span>  <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">node</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>When a zero-initialized state isn’t ideal, objects can still initialize to
a more useful state straight out of the arena. The second case is even
self-referencing, which is specifically supported through placement new.
Otherwise you’d need a special-written copy or move constructor.</p>

<p><code class="language-plaintext highlighter-rouge">make</code> could accept constructor arguments and perfect forward them to a
constructor. However, that’s too far into the dark arts for my comfort,
plus it requires a correct definition of <code class="language-plaintext highlighter-rouge">std::forward</code>. In practice that
means <code class="language-plaintext highlighter-rouge">#include</code>-ing it, and whatever comes in with it. Or ask an expert
capable of writing such a definition from scratch, though both are
probably too busy.</p>

<p><strong>Update 1</strong>: One of those experts, Jonathan Müller, kindly reached out to
say that <a href="https://www.foonathan.net/2020/09/move-forward/">a static cast is sufficient</a>. This is easy to do:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">,</span> <span class="k">typename</span> <span class="o">...</span><span class="nc">A</span><span class="p">&gt;</span>
<span class="k">static</span> <span class="n">T</span> <span class="o">*</span><span class="nf">make</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">size</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">A</span> <span class="o">&amp;&amp;</span><span class="p">...</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
        <span class="k">new</span> <span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">r</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="n">T</span><span class="p">{(</span><span class="n">A</span> <span class="o">&amp;&amp;</span><span class="p">)</span><span class="n">args</span><span class="p">...};</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Update 2</strong>: I later realized that because I do not care about copy or
move semantics, I also don’t care about perfect forwarding. I can simply
expand the parameter pack without casting or <code class="language-plaintext highlighter-rouge">&amp;&amp;</code>. I also don’t want the
extra restrictions on braced initializer conversions, so better to use
parentheses with <code class="language-plaintext highlighter-rouge">new</code>.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">,</span> <span class="k">typename</span> <span class="o">...</span><span class="nc">A</span><span class="p">&gt;</span>
<span class="k">static</span> <span class="n">T</span> <span class="o">*</span><span class="nf">make</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">size</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">A</span> <span class="p">...</span><span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
        <span class="k">new</span> <span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">r</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="n">T</span><span class="p">(</span><span class="n">args</span><span class="p">...);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>One small gotcha: placement new doesn’t work out of the box, and you need
to provide a definition. That means including <code class="language-plaintext highlighter-rouge">&lt;new&gt;</code> or writing one out.
Fortunately it’s trivial, but the prototype must exactly match, including
<code class="language-plaintext highlighter-rouge">size_t</code>:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="k">operator</span> <span class="nf">new</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">p</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>

<p>Overall I feel the template is a small improvement over the macro.</p>

<h3 id="strings">Strings</h3>

<p>Recall my basic C string type, with a macro to wrap literals:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define countof(a)  (size)(sizeof(a) / sizeof(*(a)))
#define s8(s)       (s8){(u8 *)s, countof(s)-1}
</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">u8</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">s8</span><span class="p">;</span>
</code></pre></div></div>

<p>Since it doesn’t own the underlying buffer — region-based allocation has
already solved the ownership problem — this is what C++ long-windedly
calls a <code class="language-plaintext highlighter-rouge">std::string_view</code>. In C++ we won’t need the <code class="language-plaintext highlighter-rouge">countof</code> macro for
strings, but it’s still generally useful. Converting it to a template,
which is <em>theoretically</em> more robust (rejects pointers), but comes with <a href="https://vittorioromeo.info/index/blog/debug_performance_cpp.html">a
non-zero cost</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">template</span><span class="o">&lt;</span><span class="kr">typename</span> <span class="n">T</span><span class="p">,</span> <span class="n">size</span> <span class="n">N</span><span class="o">&gt;</span>
<span class="n">size</span> <span class="nf">countof</span><span class="p">(</span><span class="n">T</span> <span class="p">(</span><span class="o">&amp;</span><span class="p">)[</span><span class="n">N</span><span class="p">])</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">N</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The reference — here a reference to an array — is unavoidable, so it’s one
of the rare cases. The same concept applies as an <code class="language-plaintext highlighter-rouge">s8</code> constructor to
replace the macro:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">s8</span> <span class="p">{</span>
    <span class="n">u8</span>  <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">len</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">s8</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>

    <span class="n">template</span><span class="o">&lt;</span><span class="n">size</span> <span class="n">N</span><span class="o">&gt;</span>
    <span class="n">s8</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">)[</span><span class="n">N</span><span class="p">])</span> <span class="o">:</span> <span class="n">data</span><span class="p">{(</span><span class="n">u8</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">}</span> <span class="p">{}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I’ve explicitly asked to keep a default zero-initialized (empty) string
since it’s useful — and necessary to directly allocate strings using
<code class="language-plaintext highlighter-rouge">make</code>, e.g. an array of strings. <code class="language-plaintext highlighter-rouge">const</code> is required because string
literals are <code class="language-plaintext highlighter-rouge">const</code> in C++, but it’s immediately stripped off for the
sake of simplicity. The new constructor allows:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span> <span class="n">version</span> <span class="o">=</span> <span class="s">"1.2.3"</span><span class="p">;</span>
</code></pre></div></div>

<p>Or even <a href="/blog/2023/02/13/">more usefully</a>:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>
    <span class="c1">// ...</span>
    <span class="n">print</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="s">"hello world</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>Define <code class="language-plaintext highlighter-rouge">operator==</code> and it’s more useful yet:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">b32</span> <span class="k">operator</span><span class="o">==</span><span class="p">(</span><span class="n">s8</span> <span class="n">s</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="n">len</span><span class="o">==</span><span class="n">s</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span> <span class="o">||</span> <span class="o">!</span><span class="n">memcmp</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">len</span><span class="p">));</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Now this works, and it’s cheap and fast even in debug builds:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span> <span class="n">key</span> <span class="o">=</span> <span class="p">...;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">key</span> <span class="o">==</span> <span class="s">"HOME"</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>That’s more ergonomic than the macro and comparison function. <code class="language-plaintext highlighter-rouge">operator[]</code>
also improves ergonomics, to subscript a string without going through the
<code class="language-plaintext highlighter-rouge">data</code> member:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">u8</span> <span class="o">&amp;</span><span class="k">operator</span><span class="p">[](</span><span class="n">size</span> <span class="n">i</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The reference is again necessary to make subscripts assignable. Since
<code class="language-plaintext highlighter-rouge">s8span</code> — make a string spanning two pointers — so often appears in my
programs, a constructor seems appropriate, too:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span><span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">beg</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">end</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">beg</span> <span class="o">&lt;=</span> <span class="n">end</span><span class="p">);</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">beg</span><span class="p">;</span>
        <span class="n">len</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">beg</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>By the way, these assertions I’ve been using are great for catching
mistakes quickly and early, and they complement <a href="/blog/2019/01/25/">fuzz testing</a>.</p>

<p>I’m not sold on it, but an idea for the future: C++23’s multi-index
<code class="language-plaintext highlighter-rouge">operator[]</code> as a slice operator:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span> <span class="k">operator</span><span class="p">[](</span><span class="n">size</span> <span class="n">beg</span><span class="p">,</span> <span class="n">size</span> <span class="n">end</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">beg</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">beg</span> <span class="o">&lt;=</span> <span class="n">end</span><span class="p">);</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">end</span> <span class="o">&lt;=</span> <span class="n">len</span><span class="p">);</span>
        <span class="k">return</span> <span class="p">{</span><span class="n">data</span><span class="o">+</span><span class="n">beg</span><span class="p">,</span> <span class="n">data</span><span class="o">+</span><span class="n">end</span><span class="p">};</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Then:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span> <span class="n">msg</span> <span class="o">=</span> <span class="s">"foo bar baz"</span><span class="p">;</span>
    <span class="n">msg</span> <span class="o">=</span> <span class="n">msg</span><span class="p">[</span><span class="mi">4</span><span class="p">,</span><span class="mi">7</span><span class="p">];</span>  <span class="c1">// msg = "bar"</span>
</code></pre></div></div>

<p>I could keep going with, say, iterators and such, but each will be more
specialized and less useful. (I don’t care about range-based <code class="language-plaintext highlighter-rouge">for</code> loops.)</p>

<h3 id="downside-static-initialization">Downside: static initialization</h3>

<p>The new string stuff is neat, but I hit a wall trying it out: These fancy
constructors do not reliably construct at compile time, <em>not even with a
<code class="language-plaintext highlighter-rouge">constexpr</code> qualifier</em> in two of the three major C++ implementations. A
static lookup table that contains a string is likely constructed at run
time in at least some builds. For example, this table:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">s8</span> <span class="n">keys</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"foo"</span><span class="p">,</span> <span class="s">"bar"</span><span class="p">,</span> <span class="s">"baz"</span><span class="p">};</span>
</code></pre></div></div>

<p>Requires run-time construction in real world cases I care about, requiring
C++ magic and linking runtime gunk. The constructor is therefore a strict
downgrade from the macro, which works perfectly in these lookup tables.
Once a non-default constructor is defined, I’ve been unable to find an
escape hatch back to the original, dumb, reliable behavior.</p>

<p><strong>Update</strong>: Jonathan Müller points out the reinterpret cast is forbidden
in a <code class="language-plaintext highlighter-rouge">constexpr</code> function, so it’s not required to happen at compile time.
After some thought, I’ve figured out a workaround using a union:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">s8</span> <span class="p">{</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="n">u8</span>         <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">cdata</span><span class="p">;</span>
    <span class="p">};</span>
    <span class="n">size</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">template</span><span class="o">&lt;</span><span class="n">size</span> <span class="n">N</span><span class="p">&gt;</span>
    <span class="k">constexpr</span> <span class="n">s8</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">)[</span><span class="n">N</span><span class="p">])</span> <span class="o">:</span> <span class="n">cdata</span><span class="p">{</span><span class="n">s</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">}</span> <span class="p">{}</span>

    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In all three C++ implementations, in all configurations, this reliably
constructs strings at compile time. The other semantics are unchanged.</p>

<h3 id="other-features">Other features</h3>

<p>Having a generic dynamic array would be handy, and more ergonomic than <a href="/blog/2023/10/05/">my
dynamic array macro</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">template</span><span class="o">&lt;</span><span class="kr">typename</span> <span class="n">T</span><span class="o">&gt;</span>
<span class="k">struct</span> <span class="n">slice</span> <span class="p">{</span>
    <span class="n">T</span>   <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">len</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">cap</span>  <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="n">slice</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>

    <span class="n">template</span><span class="o">&lt;</span><span class="n">size</span> <span class="n">N</span><span class="o">&gt;</span>
    <span class="n">slice</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">T</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">)[</span><span class="n">N</span><span class="p">])</span> <span class="o">:</span> <span class="n">data</span><span class="p">{</span><span class="n">a</span><span class="p">},</span> <span class="n">len</span><span class="p">{</span><span class="n">N</span><span class="p">},</span> <span class="n">cap</span><span class="p">{</span><span class="n">N</span><span class="p">}</span> <span class="p">{}</span>

    <span class="n">T</span> <span class="o">&amp;</span><span class="n">operator</span><span class="p">[](</span><span class="n">size</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span>
<span class="p">}</span>

<span class="n">template</span><span class="o">&lt;</span><span class="kr">typename</span> <span class="n">T</span><span class="o">&gt;</span>
<span class="n">slice</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span> <span class="n">append</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="p">,</span> <span class="n">slice</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">T</span><span class="p">);</span>
</code></pre></div></div>

<p>On the other hand, <a href="/blog/2023/09/30/">hash maps are mostly solved</a>, so I wouldn’t
bother with a generic map.</p>

<p>Function overloads would simplify naming. For example, this in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prints8</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>
<span class="n">printi32</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span><span class="p">);</span>
<span class="n">printf64</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">f64</span><span class="p">);</span>
<span class="n">printvec3</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">vec3</span><span class="p">);</span>
</code></pre></div></div>

<p>Would hide that stuff behind the scenes in the symbol decoration:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">print</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">i32</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">f64</span><span class="p">);</span>
<span class="n">print</span><span class="p">(</span><span class="n">bufout</span> <span class="o">*</span><span class="p">,</span> <span class="n">vec3</span><span class="p">);</span>
</code></pre></div></div>

<p>Same goes for a <code class="language-plaintext highlighter-rouge">hash()</code> function on different types.</p>

<p>C++ has better null pointer semantics than C. Addition or subtraction of
zero with a null pointer produces a null pointer, and subtracting null
pointers results in zero. This eliminates some boneheaded special case
checks required in C, though not all: <code class="language-plaintext highlighter-rouge">memcpy</code>, for instance, arbitrarily
still does not accept null pointers even in C++.</p>

<h3 id="ultimately-worth-it">Ultimately worth it?</h3>

<p>The static data problem is a real bummer, but perhaps it’s worth it for
the other features. I still need to put it all to the test in a real,
sizable project.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Protecting paths in macro expansions by extending UTF-8</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/03/05/"/>
    <id>urn:uuid:92065961-7687-4618-bd78-e4442041f2e4</id>
    <updated>2024-03-05T03:15:12Z</updated>
    <category term="c"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>After a year I’ve finally came up with an elegant solution to a vexing
<a href="/blog/2023/01/18/">u-config</a> problem. The pkg-config format uses macros to generate build
flags through recursive expansion. Some flags embed file system paths, but
to the macro system it’s all strings. The output is also ultimately just
one big string, which the receiving shell <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_05">splits into fields</a>. If
a path contains spaces, or shell metacharacters, u-config must escape them
so that shells treat them as part of a token. But how can u-config itself
distinguish incidental spaces in paths from deliberate spaces between
flags? What about other shell metacharacters in paths? My solution is to
extend UTF-8 to encode metadata that survives macro expansion.</p>

<p>As usual, it helps to begin with a concrete example of the problem. The
following is a conventional <code class="language-plaintext highlighter-rouge">.pc</code> file much like you’d  find on your own
system:</p>

<pre><code class="language-pc">prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: Example
Version: 1.0
Description: An example .pc file
Cflags: -I${includedir}
Libs: -L${libdir} -lexample
</code></pre>

<p>It begins by defining the library’s installation prefix from which it
derives additional paths, which are finally used in the package fields
that generate build flags (<code class="language-plaintext highlighter-rouge">Cflags</code>, <code class="language-plaintext highlighter-rouge">Libs</code>). If I run u-config against
this configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pkg-config --cflags --libs example
-I/usr/include -L/usr/lib -lexample
</code></pre></div></div>

<p>Typically <code class="language-plaintext highlighter-rouge">prefix</code> is populated by the library’s build system, which knows
where the library is to be installed. In some situations that’s not
possible, and there is no opportunity to set <code class="language-plaintext highlighter-rouge">prefix</code> to a meaningful
path. In that case, pkg-config can automatically override it
(<code class="language-plaintext highlighter-rouge">--define-prefix</code>) with a path relative to the <code class="language-plaintext highlighter-rouge">.pc</code> file, making the
installation relocatable. This works quite well <a href="/blog/2020/09/25/">on Windows, where it’s
the default</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pkg-config --cflags --libs example
-IC:/Users/me/example/include -LC:/Users/me/example/lib -lexample
</code></pre></div></div>

<p>This just works… <em>so long as the path does not contain spaces</em>. If so, it
risks splitting into separate fields. The <code class="language-plaintext highlighter-rouge">.pc</code> format supports quoting to
control how such output is escaped. Regions between quotes are escaped in
the output so that they retain their spaces when field split in the shell.
If a <code class="language-plaintext highlighter-rouge">.pc</code> file author is careful, they’d write it with quotes:</p>

<pre><code class="language-pc">Cflags: -I"${includedir}"
Libs: -L"${libdir}" -lexample
</code></pre>

<p>The paths are carefully placed within <a href="/blog/2021/12/04/">quoted regions</a> so that they
come out properly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pkg-config --cflags example
-IC:/Program\ Files/example/include
</code></pre></div></div>

<p><em>Almost nobody writes their <code class="language-plaintext highlighter-rouge">.pc</code> files this way</em>! The convention is not
to quote. My original solution was to implicitly wrap <code class="language-plaintext highlighter-rouge">prefix</code> in quotes
on assignment, which fixes the vast majority of <code class="language-plaintext highlighter-rouge">.pc</code> files. That
effectively looks like this in the “virtual” <code class="language-plaintext highlighter-rouge">.pc</code> file:</p>

<pre><code class="language-pc">prefix="C:/Program Files/example"
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
</code></pre>

<p>So the important region is quoted, its spaces preserved. However, the
occasional library author actively supporting Windows inevitably runs into
this problem, and their system’s pkg-config implementation does not quote
<code class="language-plaintext highlighter-rouge">prefix</code>. They soon figure out explicit quoting and apply it, which then
undermines u-config’s implicit quoting. The quotes essentially cancel out:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"$includedir" -&gt; ""C:/Program Files/example"/include"
</code></pre></div></div>

<p>The quoted regions are inverted and nothing happens. Though this is a
small minority, the libraries that do this and the ones you’re likely to
use on Windows are correlated. I was stumped: How to support quoted and
unquoted <code class="language-plaintext highlighter-rouge">.pc</code> files simultaneously?</p>

<h3 id="extending-utf-8">Extending UTF-8</h3>

<p>I recently had the thought: What if somehow u-config tracked which spans
of string were paths. <code class="language-plaintext highlighter-rouge">prefix</code> is initially a path span, and then track it
through macro-expansion and concatenation. Soon after that I realized it’s
even simpler: <strong>Encode the spaces in a path as a value other than space</strong>,
but also a value that cannot appear in the input. Recall that <a href="/blog/2017/10/06/">certain
octets can never appear in UTF-8 text</a>: the 8 values whose highest 5
bits are set. That would be the first octet of 5-octet, or longer, code
point, but those are forbidden.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111xxx
</code></pre></div></div>

<p>When paths enter the macro system, special characters are encoded as one
of these 8 values. They’re converted back to their original ASCII values
during output encoding, escaped. It doesn’t interact with the pkg-config
quoting mechanism, so there’s no quote cancellation. Both quoting cases
are supported equally.</p>

<p>For example, if space is mapped onto <code class="language-plaintext highlighter-rouge">\xff</code> (255), then:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>in:  C:/Program Files/foo    -&gt; C:/Program\xffFiles/foo
out: C:/Program\xffFiles/foo -&gt; C:/Program\ Files/foo
</code></pre></div></div>

<p>Which prints the same regardless of <code class="language-plaintext highlighter-rouge">${includedir}</code> or <code class="language-plaintext highlighter-rouge">"${includedir}"</code>.
Problem solved!</p>

<h3 id="more-metacharacters">More metacharacters</h3>

<p>That’s not the only complication. Outputs may <em>deliberately</em> include shell
metacharacters, though typically these are <a href="/blog/2017/08/20/">Makefile</a> fragments. For
example, the default value of <code class="language-plaintext highlighter-rouge">${pc_top_builddir}</code> is <code class="language-plaintext highlighter-rouge">$(top_builddir)</code>,
which <code class="language-plaintext highlighter-rouge">make</code> will later expand. While these characters are special to a
shell, and certainly special to <code class="language-plaintext highlighter-rouge">make</code>, they must not be escaped.</p>

<p>What if a path contains these characters? The pkg-config quoting mechanism
won’t help. It’s only concerned with spaces, and <code class="language-plaintext highlighter-rouge">$(...)</code> prints the same
quoted nor not. As before, u-config must track provenance — whether or not
such characters originated from a path.</p>

<p>If <code class="language-plaintext highlighter-rouge">$PKG_CONFIG_TOP_BUILD_DIR</code> is set, then <code class="language-plaintext highlighter-rouge">pc_top_builddir</code> is set to
this environment variable, useful when the result isn’t processed by
<code class="language-plaintext highlighter-rouge">make</code>. In this case it’s a path, and <code class="language-plaintext highlighter-rouge">$(...)</code> ought to be escaped. Even
without <code class="language-plaintext highlighter-rouge">$</code> it must be quoted, because the parentheses would still invoke
a subshell. But who would put parenthesis in a path? Lo and behold!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:/Program Files (x86)/example
</code></pre></div></div>

<p>Again, extending UTF-8 solves this as well: Encode <code class="language-plaintext highlighter-rouge">$</code>, <code class="language-plaintext highlighter-rouge">(</code>, and <code class="language-plaintext highlighter-rouge">)</code> in
paths using three of those forbidden octets, and escape them on the way
out, allowing unencoded instances to go straight through.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>in:  C:/Program\xffFiles\xff\xfdx86\xfe/example
out: C:/Program\ Files\ \(x86\)/example
</code></pre></div></div>

<p>This makes <code class="language-plaintext highlighter-rouge">pc_top_builddir</code> straightforward: default to a raw string,
otherwise a path-encoded environment variable (note: <code class="language-plaintext highlighter-rouge">s8</code> <a href="/blog/2023/10/08/">is a string
type</a> and <code class="language-plaintext highlighter-rouge">upsert</code> is <a href="/blog/2023/09/30/">a hash map</a>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">s8</span> <span class="n">top_builddir</span> <span class="o">=</span> <span class="n">s8</span><span class="p">(</span><span class="s">"$(top_builddir)"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">envvar_set</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">top_builddir</span> <span class="o">=</span> <span class="n">s8pathencode</span><span class="p">(</span><span class="n">envvar</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">global</span><span class="p">,</span> <span class="n">s8</span><span class="p">(</span><span class="s">"pc_top_builddir"</span><span class="p">),</span> <span class="n">perm</span><span class="p">)</span> <span class="o">=</span> <span class="n">top_builddir</span><span class="p">;</span>
</code></pre></div></div>

<p>For a particularly wild case, consider deliberately using a <code class="language-plaintext highlighter-rouge">uname -m</code>
command substitution to construct a path, i.e. the path contains the
target machine architecture (<code class="language-plaintext highlighter-rouge">i686</code>, <code class="language-plaintext highlighter-rouge">x86_64</code>, etc.):</p>

<pre><code class="language-pc">Cflags: -I${prefix}/$(uname -m)/include
</code></pre>

<p>(Not that condone such nonsense. This is merely a reality of real world
<code class="language-plaintext highlighter-rouge">.pc</code> files.) With <code class="language-plaintext highlighter-rouge">prefix</code> automatically set as above, this will print:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-IC:/Program\ Files\ \(x86\)/example/$(uname -m)/include
</code></pre></div></div>

<p>Path parentheses are escaped because they came from a path, but command
substitution passes through because it came from the <code class="language-plaintext highlighter-rouge">.pc</code> source. Quite
cool!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>An improved chkstk function on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/02/05/"/>
    <id>urn:uuid:381be450-559c-4521-911a-ba524dca7b64</id>
    <updated>2024-02-05T17:56:05Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="rant"/>
    <content type="html">
      <![CDATA[<p>If you’ve spent much time developing with Mingw-w64 you’ve likely seen the
symbol <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>, perhaps in an error message. It’s a little piece of
runtime provided by GCC via libgcc which ensures enough of the stack is
committed for the caller’s stack frame. The “function” uses a custom ABI
and is implemented in assembly. So is the subject of this article, a
slightly improved implementation soon to be included in <a href="/blog/2020/05/15/">w64devkit</a> as
libchkstk (<code class="language-plaintext highlighter-rouge">-lchkstk</code>).</p>

<p>The MSVC toolchain has an identical (x64) or similar (x86) function named
<code class="language-plaintext highlighter-rouge">__chkstk</code>. We’ll discuss that as well, and w64devkit will include x86 and
x64 implementations, useful when linking with MSVC object files. The new
x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> in particular is also better than the MSVC definition.</p>

<p>A note on spelling: <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> is spelled with three underscores, and
<code class="language-plaintext highlighter-rouge">__chkstk</code> is spelled with two. On x86, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/decorated-names#FormatC"><code class="language-plaintext highlighter-rouge">cdecl</code> functions</a> are
decorated with a leading underscore, and so may be rendered, e.g. in error
messages, with one fewer underscore. The true name is undecorated, and the
raw symbol name is identical on x86 and x64. Further complicating matters,
libgcc defines a <code class="language-plaintext highlighter-rouge">___chkstk</code> with three underscores. As far as I can tell,
this spelling arose from confusion regarding name decoration, but nobody’s
noticed for the past 28 years. libgcc’s x64 <code class="language-plaintext highlighter-rouge">___chkstk</code> is obviously and
badly broken, so I’m sure nobody has ever used it anyway, not even by
accident thanks to the misspelling. I’ll touch on that below.</p>

<p>When referring to a particular instance, I will use a specific spelling.
Otherwise the term “chkstk” refers to the family. If you’d like to skip
ahead to the source for libchkstk: <strong><a href="https://github.com/skeeto/w64devkit/blob/master/src/libchkstk.S"><code class="language-plaintext highlighter-rouge">libchkstk.S</code></a></strong>.</p>

<h3 id="a-gradually-committed-stack">A gradually committed stack</h3>

<p>The header of a Windows executable lists two stack sizes: a <em>reserve</em> size
and an initial <em>commit</em> size. The first is the largest the main thread
stack can grow, and the second is the amount <a href="https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc">committed</a> when the
program starts. A program gradually commits stack pages <em>as needed</em> up to
the reserve size. Binutils <code class="language-plaintext highlighter-rouge">objdump</code> option <code class="language-plaintext highlighter-rouge">-p</code> lists the sizes. Typical
output for a Mingw-w64 program:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000
</code></pre></div></div>

<p>The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB
initially committed. With the Binutils linker, <code class="language-plaintext highlighter-rouge">ld</code>, you can set them at
link time using <code class="language-plaintext highlighter-rouge">--stack</code>. Via <code class="language-plaintext highlighter-rouge">gcc</code>, use <code class="language-plaintext highlighter-rouge">-Xlinker</code>. For example, to
reserve an 8MiB stack and commit half of it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Xlinker --stack=$((8&lt;&lt;20)),$((4&lt;&lt;20)) ...
</code></pre></div></div>

<p>MSVC <code class="language-plaintext highlighter-rouge">link.exe</code> similarly has <a href="https://learn.microsoft.com/en-us/cpp/build/reference/stack-stack-allocations"><code class="language-plaintext highlighter-rouge">/stack</code></a>.</p>

<p>The purpose of this mechanism is to avoid paying the <em>commit charge</em> for
unused stack. It made sense 30 years ago when stacks were a potentially
large portion of physical memory. These days it’s a rounding error and
silly we’re still dealing with it. Using the above options you can choose
to commit the entire stack up front, at which point a chkstk helper is no
longer needed (<a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59532"><code class="language-plaintext highlighter-rouge">-mno-stack-arg-probe</code></a>, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/gs-control-stack-checking-calls"><code class="language-plaintext highlighter-rouge">/Gs2147483647</code></a>). This
requires link-time control of the main module, which isn’t always an
option, like when supplying a DLL for someone else to run.</p>

<p>The program grows the stack by touching the singular <a href="https://devblogs.microsoft.com/oldnewthing/20220203-00/?p=106215">guard page</a>
mapped between the committed and uncommitted portions of the stack. This
action triggers a page fault, and the default fault handler commits the
guard page and maps a new guard page just below. In other words, the stack
grows one page at a time, in order.</p>

<p>In most cases nothing special needs to happen. The guard page mechanism is
transparent and in the background. However, if a function stack frame
exceeds the page size then there’s a chance that it might leap over the
guard page, crashing the program. To prevent this, compilers insert a
chkstk call in the function prologue. Before local variable allocation,
chkstk walks down the stack — that is, towards lower addresses — nudging
the guard page with each step. (As a side effect it provides <a href="/blog/2017/06/21/">stack clash
protection</a> — the only security aspect of chkstk.) For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">callee</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">large</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">];</span>
    <span class="n">callee</span><span class="p">(</span><span class="n">large</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with 64-bit <code class="language-plaintext highlighter-rouge">gcc -O</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">example:</span>
    <span class="nf">movl</span>    <span class="kc">$</span><span class="mi">1048616</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">call</span>    <span class="nv">___chkstk_ms</span>
    <span class="nf">subq</span>    <span class="o">%</span><span class="nb">rax</span><span class="p">,</span> <span class="o">%</span><span class="nb">rsp</span>
    <span class="nf">leaq</span>    <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rcx</span>
    <span class="nf">call</span>    <span class="nv">callee</span>
    <span class="nf">addq</span>    <span class="kc">$</span><span class="mi">1048616</span><span class="p">,</span> <span class="o">%</span><span class="nb">rsp</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>I used GCC, but this is practically identical to the code generated by
MSVC and Clang. Note the call to <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> in the function prologue
before allocating the stack frame (<code class="language-plaintext highlighter-rouge">subq</code>). Also note that it sets <code class="language-plaintext highlighter-rouge">eax</code>.
As a volatile register, this would normally accomplish nothing because
it’s done just before a function call, but recall that <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> has
a custom ABI. That’s the argument to chkstk. Further note that it uses
<code class="language-plaintext highlighter-rouge">rax</code> on the return. That’s not the value returned by chkstk, but rather
that x64 <em>chkstk preserves all registers</em>.</p>

<p>Well, maybe. The official documentation says that registers <a href="https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog">r10 and r11
are volatile</a>, but that information conflicts with Microsoft’s own
implementation. Just in case, I choose a conservative interpretation that
all registers are preserved.</p>

<h3 id="implementing-chkstk">Implementing chkstk</h3>

<p>In a high level language, chkstk might look something like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: hypothetical implementation</span>
<span class="kt">void</span> <span class="nf">___chkstk_ms</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">frame_size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">frame</span><span class="p">[</span><span class="n">frame_size</span><span class="p">];</span>  <span class="c1">// NOTE: variable-length array</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">frame_size</span> <span class="o">-</span> <span class="n">PAGE_SIZE</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">-=</span> <span class="n">PAGE_SIZE</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">frame</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// touch the guard page</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This wouldn’t work for a number of reasons, but if it did, <code class="language-plaintext highlighter-rouge">volatile</code>
would serve two purposes. First, forcing the side effect to occur. The
second is more subtle: The loop must happen in exactly this order, from
high to low. Without <code class="language-plaintext highlighter-rouge">volatile</code>, loop iterations would be independent — as
there are no dependencies between iterations — and so a compiler could
reverse the loop direction.</p>

<p>The store can happen anywhere within the guard page, so it’s not necessary
to align <code class="language-plaintext highlighter-rouge">frame</code> to the page. Simply touching at least one byte per page
is enough. This is essentially the definition of libgcc <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>.</p>

<p>How many iterations occur? In <code class="language-plaintext highlighter-rouge">example</code> above, the stack frame will be
around 1MiB (2<sup>20</sup>). With pages of 4KiB (2<sup>12</sup>) that’s
256 iterations. The loop happens unconditionally, meaning <em>every function
call</em> requires 256 iterations of this loop. Wouldn’t it be better if the
loop ran only as needed, i.e. the first time? MSVC x64 <code class="language-plaintext highlighter-rouge">__chkstk</code> skips
iterations if possible, and the same goes for my new <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>. Much
like <a href="/blog/2022/02/18/#my-getcommandlinew">the command line string</a>, the low address of the current
thread’s guard page is accessible through the <a href="https://en.wikipedia.org/wiki/Win32_Thread_Information_Block">Thread Information
Block</a> (TIB). A chkstk can cheaply query this address, only looping
during initialization or so. (<a href="/blog/2023/03/23/">In contrast to Linux</a>, a thread’s
stack is fundamentally managed by the operating system.)</p>

<p>Taking that into account, an improved algorithm:</p>

<ol>
  <li>Push registers that will be used</li>
  <li>Compute the low address of the new stack frame (F)</li>
  <li>Retrieve the low address of the committed stack (C)</li>
  <li>Go to 7</li>
  <li>Subtract the page size from C</li>
  <li>Touch memory at C</li>
  <li>If C &gt; F, go to 5</li>
  <li>Pop registers to restore them and return</li>
</ol>

<p>A little unusual for an unconditional forward jump in pseudo-code, but
this closely matches my assembly. The loop causes page faults, and it’s
the slow, uncommon path. The common, fast path never executes 5–6. I’d
also chose smaller instructions in order to keep the function small and
reduce instruction cache pressure. My x64 implementation as of this
writing:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">___chkstk_ms:</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">rcx</span>              <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">neg</span>  <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span> <span class="nb">rax</span> <span class="err">=</span> <span class="nv">frame</span> <span class="nv">low</span> <span class="nv">address</span>
    <span class="nf">add</span>  <span class="o">%</span><span class="nb">rsp</span><span class="p">,</span> <span class="o">%</span><span class="nb">rax</span>        <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span> <span class="err">"</span>
    <span class="nf">mov</span>  <span class="o">%</span><span class="nb">gs</span><span class="p">:(</span><span class="mh">0x10</span><span class="p">),</span> <span class="o">%</span><span class="nb">rcx</span>  <span class="o">//</span> <span class="mi">3</span><span class="nv">.</span> <span class="nb">rcx</span> <span class="err">=</span> <span class="nv">stack</span> <span class="nv">low</span> <span class="nv">address</span>
    <span class="nf">jmp</span>  <span class="mi">1</span><span class="nv">f</span>                <span class="o">//</span> <span class="mi">4</span><span class="nv">.</span>
<span class="err">0:</span>  <span class="nf">sub</span>  <span class="kc">$</span><span class="mh">0x1000</span><span class="p">,</span> <span class="o">%</span><span class="nb">rcx</span>     <span class="o">//</span> <span class="mi">5</span><span class="nv">.</span>
    <span class="nf">test</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>      <span class="o">//</span> <span class="mi">6</span><span class="nv">.</span> <span class="nv">page</span> <span class="nv">fault</span> <span class="p">(</span><span class="nv">very</span> <span class="nv">slow</span><span class="err">!</span><span class="p">)</span>
<span class="err">1:</span>  <span class="nf">cmp</span>  <span class="o">%</span><span class="nb">rax</span><span class="p">,</span> <span class="o">%</span><span class="nb">rcx</span>        <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">ja</span>   <span class="mb">0b</span>                <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">rcx</span>              <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">rax</span>              <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">ret</span>                    <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
</code></pre></div></div>

<p>I’ve labeled each instruction with its corresponding pseudo-code. Step 6
is unusual among chkstk implementations: It’s not a <em>store</em>, but a <em>load</em>,
still sufficient to fault the page. That <code class="language-plaintext highlighter-rouge">test</code> instruction is just two
bytes, and unlike other two-byte options, doesn’t write garbage onto the
stack — which <em>would</em> be allowed — nor use an extra register. I searched
through single byte instructions that can page fault, all of which involve
implicit addressing through <code class="language-plaintext highlighter-rouge">rdi</code> or <code class="language-plaintext highlighter-rouge">rsi</code>, but they increment <code class="language-plaintext highlighter-rouge">rdi</code> or
<code class="language-plaintext highlighter-rouge">rsi</code>, and would would require another instruction to correct it.</p>

<p>Because of the return address and two <code class="language-plaintext highlighter-rouge">push</code> operations, the low stack
frame address is technically <em>too low</em> by 24 bytes. That’s fine. If this
exhausts the stack, the program is really cutting it close and the stack
is too small anyway. I could be more precise — which, as we’ll soon see,
is required for x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> — but it would cost an extra instruction
byte.</p>

<p>On x64, <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> and <code class="language-plaintext highlighter-rouge">__chkstk</code> have identical semantics, so name it
<code class="language-plaintext highlighter-rouge">__chkstk</code> — which I’ve done in libchkstk — and it works with MSVC. The
only practical difference between my chkstk and MSVC <code class="language-plaintext highlighter-rouge">__chkstk</code> is that
mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking
the optimization, is libgcc <code class="language-plaintext highlighter-rouge">___chkstk_ms</code>, weighing 50 bytes, or in
practice, due to an unfortunate Binutils default of padding sections, 64
bytes.</p>

<p>I’m no assembly guru, and I bet this can be even smaller without hurting
the fast path, but this is the best I could come up with at this time.</p>

<p><strong>Update</strong>: Stefan Kanthak, who has <a href="https://skanthak.homepage.t-online.de/msvcrt.html">extensively explored this
topic</a>, points out that large stack frame requests might overflow
my low frame address calculation at (3), effectively disabling the probe.
Such requests might occur from alloca calls or variable-length arrays
(VLAs) with untrusted sizes. As far as I’m concerned, such programs are
already broken, but it only cost a two-byte instruction to deal with it. I
have not changed this article, but the source in w64devkit <a href="https://github.com/skeeto/w64devkit/commit/50b343db">has been
updated</a>.</p>

<h3 id="32-bit-chkstk">32-bit chkstk</h3>

<p>On x86 <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> has identical semantics to x64. Mine is a copy-paste
of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC
was ahead of the curve on this design.</p>

<p>However, x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> is <em>bonkers</em>. It not only commits the stack, but
also allocates the stack frame. That is, it returns with a different stack
pointer. The return pointer is initially <em>inside the new stack frame</em>, so
chkstk must retrieve it and return by other means. It must also precisely
compute the low frame address.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">__chkstk:</span>
    <span class="nf">push</span> <span class="o">%</span><span class="nb">ecx</span>               <span class="o">//</span> <span class="mi">1</span><span class="nv">.</span>
    <span class="nf">neg</span>  <span class="o">%</span><span class="nb">eax</span>               <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span>
    <span class="nf">lea</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">esp</span><span class="p">,</span><span class="o">%</span><span class="nb">eax</span><span class="p">),</span> <span class="o">%</span><span class="nb">eax</span> <span class="o">//</span> <span class="mi">2</span><span class="nv">.</span>
    <span class="nf">mov</span>  <span class="o">%</span><span class="nb">fs</span><span class="p">:(</span><span class="mh">0x08</span><span class="p">),</span> <span class="o">%</span><span class="nb">ecx</span>   <span class="o">//</span> <span class="mi">3</span><span class="nv">.</span>
    <span class="nf">jmp</span>  <span class="mi">1</span><span class="nv">f</span>                 <span class="o">//</span> <span class="mi">4</span><span class="nv">.</span>
<span class="err">0:</span>  <span class="nf">sub</span>  <span class="kc">$</span><span class="mh">0x1000</span><span class="p">,</span> <span class="o">%</span><span class="nb">ecx</span>      <span class="o">//</span> <span class="mi">5</span><span class="nv">.</span>
    <span class="nf">test</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="p">(</span><span class="o">%</span><span class="nb">ecx</span><span class="p">)</span>       <span class="o">//</span> <span class="mi">6</span><span class="nv">.</span> <span class="nv">page</span> <span class="nv">fault</span> <span class="p">(</span><span class="nv">very</span> <span class="nv">slow</span><span class="err">!</span><span class="p">)</span>
<span class="err">1:</span>  <span class="nf">cmp</span>  <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">ecx</span>         <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">ja</span>   <span class="mb">0b</span>                 <span class="o">//</span> <span class="mi">7</span><span class="nv">.</span>
    <span class="nf">pop</span>  <span class="o">%</span><span class="nb">ecx</span>               <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span>
    <span class="nf">xchg</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">esp</span>         <span class="o">//</span> <span class="nv">?.</span> <span class="nb">al</span><span class="nv">locate</span> <span class="nv">frame</span>
    <span class="nf">jmp</span>  <span class="o">*</span><span class="p">(</span><span class="o">%</span><span class="nb">eax</span><span class="p">)</span>            <span class="o">//</span> <span class="mi">8</span><span class="nv">.</span> <span class="nv">return</span>
</code></pre></div></div>

<p>The main differences are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">eax</code> is treated as volatile, so it is not saved</li>
  <li>The low frame address is precisely computed with <code class="language-plaintext highlighter-rouge">lea</code> (2)</li>
  <li>The frame is allocated at step (?) by swapping F and the stack pointer</li>
  <li>Post-swap F now points at the return address, so jump through it</li>
</ul>

<p>MSVC x86 <code class="language-plaintext highlighter-rouge">__chkstk</code> does not query the TIB (3), and so unconditionally
runs the loop. So there’s an advantage to my implementation besides size.</p>

<p>libgcc x86 <code class="language-plaintext highlighter-rouge">___chkstk</code> has this behavior, and so it’s also a suitable
<code class="language-plaintext highlighter-rouge">__chkstk</code> aside from the misspelling. Strangely, libgcc x64 <code class="language-plaintext highlighter-rouge">___chkstk</code>
<em>also</em> allocates the stack frame, which is never how chkstk was supposed
to work on x64. I can only conclude it’s never been used.</p>

<h3 id="optimization-in-practice">Optimization in practice</h3>

<p>Does the skip-the-loop optimization matter in practice? Consider a
function using a large-ish, stack-allocated array, perhaps to process
<a href="/blog/2023/08/23/">environment variables</a> or <a href="https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">long paths</a>, each of which max out
around 64KiB.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">path_contains</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="n">wchar</span> <span class="o">*</span><span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">var</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">15</span><span class="p">];</span>
    <span class="n">GetEnvironmentVariableW</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">var</span><span class="p">,</span> <span class="n">countof</span><span class="p">(</span><span class="n">var</span><span class="p">));</span>
    <span class="c1">// ... search for path in var ...</span>
<span class="p">}</span>

<span class="kt">int64_t</span> <span class="nf">getfilesize</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">wide</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">15</span><span class="p">];</span>
    <span class="n">MultiByteToWideChar</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">wide</span><span class="p">,</span> <span class="n">countof</span><span class="p">(</span><span class="n">wide</span><span class="p">));</span>
    <span class="c1">// ... look up file size via wide path ...</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">path_contains</span><span class="p">(</span><span class="s">L"PATH"</span><span class="p">,</span> <span class="s">L"c:</span><span class="se">\\</span><span class="s">windows</span><span class="se">\\</span><span class="s">system32"</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>

    <span class="kt">int64_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">getfilesize</span><span class="p">(</span><span class="s">"π.txt"</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Each call to these functions with such large local arrays is also a call
to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely
detectable in a benchmark. If the function touches the file system, which
is likely when processing paths, then chkstk doesn’t matter at all. My
starting example had a 1MiB array, or 256 chkstk iterations. That starts
to become measurable, though it’s also pushing the limits. At that point
you <a href="/blog/2023/09/27/">ought to be using a scratch arena</a>.</p>

<p>So ultimately after writing an improved <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> I could only
measure a tiny difference in contrived programs, and none in any real
application. Though there’s still one more benefit I haven’t yet
mentioned…</p>

<h3 id="the-first-thing-we-do-lets-kill-all-the-lawyers">“The first thing we do, let’s <a href="/blog/2023/06/22/#119-henry-vi">kill all the lawyers</a>”.</h3>

<p>My original motivation for this project wasn’t the optimization — which I
didn’t even discover until after I had started — but <em>licensing</em>. I hate
software licenses, and the <a href="/blog/2023/01/18/">tools I’ve written for w64devkit</a>
are dedicated to the public domain. Both source <em>and</em> binaries (as
distributed). I can do so because <a href="/blog/2023/02/15/">I don’t link runtime components</a>,
not even libgcc. Not <a href="/blog/2023/05/31/">even header files</a>. Every byte of code in those
binaries is my work or the work of my collaborators.</p>

<p>Every once in awhile <code class="language-plaintext highlighter-rouge">___chkstk_ms</code> rears its ugly head, and I have to
make a decision. Do I re-work my code to avoid it? Do I take the reigns of
the linker and disable stack probes? I haven’t necessarily allocated a
large local array: A bit of luck with function inlining can combine
several smaller stack frames into one that’s just large enough to require
chkstk.</p>

<p>Since libgcc falls under the <a href="https://www.gnu.org/licenses/gcc-exception-3.1.html">GCC Runtime Library Exception</a>, if it’s
linked into my program through an “Eligible Compilation Process” — which I
believe includes w64devkit — then the GPL-licensed functions embedded in
my binary are legally siloed and the GPL doesn’t infect the rest of the
program. These bits are still GPL in isolation, and if someone were to
copy them out of the program then they’d be normal GPL code again. In
other words, it’s not a 100% public domain binary if libgcc was linked!</p>

<p>(If some FSF lawyer says I’m wrong, then this is an escape hatch through
which anyone can scrub the GPL from GCC runtime code, and then ignore the
runtime exception entirely.)</p>

<p>MSVC is worse. Hardly anyone follows its license, but fortunately for most
the license is practically unenforced. Its chkstk, which currently resides
in a loose <code class="language-plaintext highlighter-rouge">chkstk.obj</code>, falls into what Microsoft calls “Distributable
Code.” Its license requires “external end users to agree to terms that
protect the Distributable Code.” In other words, if you compile a program
with MSVC, you’re required to have a EULA including the relevant terms
from the Visual Studio license. You’re not legally permitted to distribute
software in the manner of w64devkit — no installer, just a portable zip
distribution — if that software has been built with MSVC.  At least not
without special care which nobody does. (Don’t worry, I won’t tell.)</p>

<h3 id="how-to-use-libchkstk">How to use libchkstk</h3>

<p>To avoid libgcc entirely you need <code class="language-plaintext highlighter-rouge">-nostdlib</code>. Otherwise it’s implicitly
offered to the linker, and you’d need to manually check if it picked up
code from libgcc. If <code class="language-plaintext highlighter-rouge">ld</code> complains about a missing chkstk, use <code class="language-plaintext highlighter-rouge">-lchkstk</code>
to get a definition. If you use <code class="language-plaintext highlighter-rouge">-lchkstk</code> when it’s not needed, nothing
happens, so it’s safe to always include.</p>

<p>I also recently added <a href="https://github.com/skeeto/w64devkit/blob/master/src/libmemory.c">a libmemory</a> to w64devkit, providing tiny,
public domain definitions of <code class="language-plaintext highlighter-rouge">memset</code>, <code class="language-plaintext highlighter-rouge">memcpy</code>, <code class="language-plaintext highlighter-rouge">memmove</code>, <code class="language-plaintext highlighter-rouge">memcmp</code>, and
<code class="language-plaintext highlighter-rouge">strlen</code>. All compilers fabricate calls to these five functions even if
you don’t call them yourself, which is how they were selected. (Not
because I like them. <a href="/blog/2023/02/11/">I really don’t.</a>). If a <code class="language-plaintext highlighter-rouge">-nostdlib</code> build
complains about these, too, then add <code class="language-plaintext highlighter-rouge">-lmemory</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -nostdlib ... -lchkstk -lmemory
</code></pre></div></div>

<p>In MSVC the equivalent option is <code class="language-plaintext highlighter-rouge">/nodefaultlib</code>, after which you may see
missing chkstk errors, and perhaps more. <code class="language-plaintext highlighter-rouge">libchkstk.a</code> is compatible with
MSVC, and <code class="language-plaintext highlighter-rouge">link.exe</code> doesn’t care that the extension is <code class="language-plaintext highlighter-rouge">.a</code> rather than
<code class="language-plaintext highlighter-rouge">.lib</code>, so supply it at link time. Same goes for <code class="language-plaintext highlighter-rouge">libmemory.a</code> if you need
any of those, too.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl ... /link /nodefaultlib libchkstk.a libmemory.a
</code></pre></div></div>

<p>While I despise licenses, I still take them seriously in the software I
distribute. With libchkstk I have another tool to get it under control.</p>

<hr />

<p>Big thanks to Felipe Garcia for reviewing and correcting mistakes in this
article before it was published!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Two handy GDB breakpoint tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/01/28/"/>
    <id>urn:uuid:e56cce3b-8e70-497b-a13a-e609bacdde88</id>
    <updated>2024-01-28T21:56:07Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>Over the past couple months I’ve discovered a couple of handy tricks for
working with GDB breakpoints. I figured these out on my own, and I’ve not
seen either discussed elsewhere, so I really ought to share them.</p>

<h3 id="continuable-assertions">Continuable assertions</h3>

<p>The <code class="language-plaintext highlighter-rouge">assert</code> macro in typical C implementations <a href="/blog/2022/06/26/">leaves a lot to be
desired</a>, as does <code class="language-plaintext highlighter-rouge">raise</code> and <code class="language-plaintext highlighter-rouge">abort</code>, so I’ve suggested
alternative definitions that behave better under debuggers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define assert(c)  while (!(c)) __builtin_trap()
#define assert(c)  while (!(c)) __builtin_unreachable()
#define assert(c)  while (!(c)) *(volatile int *)0 = 0
</span></code></pre></div></div>

<p>Each serves a slightly <a href="/blog/2023/10/08/#macros">different purpose</a> but still has the most
important property: Immediately halt the program <em>directly on the defect</em>.
None have an occasionally useful secondary property: Optionally allow the
program to continue through the defect. If the program reaches the body of
any of these macros then there is no reliable continuation. Even manually
nudging the instruction pointer over the assertion isn’t enough. Compilers
assume that the program cannot continue through the condition and generate
code accordingly.</p>

<p>The MSVC ecosystem has a solution for this on x86: <code class="language-plaintext highlighter-rouge">int3</code>. The portable
name is <code class="language-plaintext highlighter-rouge">__debugbreak</code>, a name <a href="/blog/2022/07/31/">I’ve borrowed elsewhere</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define assert(c)  do if (!(c)) __debugbreak(); while (0)
</span></code></pre></div></div>

<p>On x86 it inserts an <code class="language-plaintext highlighter-rouge">int3</code> instruction, which fires an interrupt,
trapping in the attached debugger, or otherwise abnormally terminating the
program. Because it’s an interrupt, it’s expected that the program might
continue. It even leaves the instruction pointer on the next instruction.
As of this writing, GCC has no matching intrinsic, but Clang recently
added <code class="language-plaintext highlighter-rouge">__builtin_debugtrap</code>. In GCC you need some less portable inline
assembly: <code class="language-plaintext highlighter-rouge">asm ("int3")</code>.</p>

<p>However, regardless of how you get an <code class="language-plaintext highlighter-rouge">int3</code> in your program, GDB does not
currently understand it. The problem is that feature I mentioned: The
instruction pointer does not point at the <code class="language-plaintext highlighter-rouge">int3</code> but the next instruction.
This confuses GDB, causing it to break in the wrong places, possibly even
in the wrong scope. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">int3_assert</span><span class="p">(...);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With <code class="language-plaintext highlighter-rouge">int3</code> at the very end of the loop, GDB will break at the <em>top</em> of
the next loop iteration, because that’s where the instruction pointer
lands by the time GDB is involved. It’s a similar story when placed at the
end of a function, leaving GDB to break in the caller. To resolve this, we
need the instruction pointer to still be “inside” the breakpoint after the
interrupt fires. Easy! Add a <code class="language-plaintext highlighter-rouge">nop</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define breakpoint()  asm ("int3; nop")
</span></code></pre></div></div>

<p>This behaves beautifully, eliminating all the problems GDB has with a
plain <code class="language-plaintext highlighter-rouge">int3</code>. Not only is this a solid basis for a continuable assertion,
it’s also useful as a fast conditional breakpoint, where conventional
conditional breakpoints are far too slow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1000000000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* rare condition */</span><span class="p">)</span> <span class="n">breakpoint</span><span class="p">();</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Could GDB handle <code class="language-plaintext highlighter-rouge">int3</code> better? Yes! Visual Studio, for instance, does not
require the <code class="language-plaintext highlighter-rouge">nop</code> instruction. As far as I know there is no ARM equivalent
compatible with GDB (or even LLDB). The closest instruction, <code class="language-plaintext highlighter-rouge">brk #0x1</code>,
does not behave as needed.</p>

<h3 id="named-positions">Named positions</h3>

<p>GDB’s built-in user interface understands three classes of breakpoint
positions: symbols, context-free line numbers, and absolute addresses.
When you set some breakpoints and (re)start a program under GDB, each kind
of breakpoint is handled differently:</p>

<ul>
  <li>
    <p>Resolve each symbol, placing a breakpoint on its run-time address.</p>
  </li>
  <li>
    <p>Map each file+lineno tuple to a run-time address, and place a breakpoint
on that address. If the line does not exist (i.e. the file is shorter),
skip it.</p>
  </li>
  <li>
    <p>Place breakpoints exactly on each absolute address. If it’s not a mapped
address, don’t start the program.</p>
  </li>
</ul>

<p>The first is the best case because it adapts to program changes. Modify
the code, recompile, and the breakpoint generally remains where you want
it.</p>

<p>The third is the least useful. These breakpoints rarely survive across
rebuilds, and sometimes not even across reruns.</p>

<p>The second is in the middle between useful and useless. If you edit the
source file which has the breakpoint — likely, because you placed the
breakpoint there for a reason — chances are high that the line number is
no longer correct. Instead it drifts, requiring manual replacement. This
is tedious and GDB ought to do better. Think that’s unreasonable? The
Visual Studio debugger does exactly that <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C2d3d7662a361ddd049f7dc65b94cecdd%40disroot.org%3E#%3C20240112210447.mxhvo7bg4mjp4jyz@nullprogram.com%3E">quite effectively</a> through
external code edits! GDB front ends tend to handle it better, especially
when they’re also the code editor and so directly observe all edits.</p>

<p>As a workaround we can get the first kind by temporarily <em>naming</em> a line
number. This requires editing the source, but remember, the very reason we
need it is because the source in question is actively changing. How to
name a line? C and C++ labels give a name to program position:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">example</span><span class="p">(</span><span class="kt">double</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="p">...)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="nl">loop:</span>  <span class="c1">// named position at the start of the loop</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The name <code class="language-plaintext highlighter-rouge">loop</code> is local to <code class="language-plaintext highlighter-rouge">example</code>, but the qualified <code class="language-plaintext highlighter-rouge">example:loop</code> is
a global name, as suitable as any other symbol. I could, say, reliably
trace the progress of this loop despite changes to its position in the
source.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]
</code></pre></div></div>

<p>One downside is dealing with <code class="language-plaintext highlighter-rouge">-Wunused-label</code> (enabled by <code class="language-plaintext highlighter-rouge">-Wall</code>), and so
I’ve considered disabling the warning in <a href="/blog/2023/04/29/">my defaults</a>. <strong>Update</strong>:
Matthew Fernandez pointed out that the <code class="language-plaintext highlighter-rouge">unused</code> label attribute eliminates
the warning, solving my problem:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="nl">loop:</span> <span class="n">__attribute</span><span class="p">((</span><span class="n">unused</span><span class="p">))</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>More often I use an assembly label, usually named <code class="language-plaintext highlighter-rouge">b</code> for convenience:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">asm</span> <span class="p">(</span><span class="s">"b:"</span><span class="p">);</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Like <code class="language-plaintext highlighter-rouge">int3</code>, sometimes it’s necessary to give it a <code class="language-plaintext highlighter-rouge">nop</code> so that GDB has
something on which to break. “Enabling” it at any time is quick:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) b b
</code></pre></div></div>

<p>Because it’s not <a href="https://sourceware.org/binutils/docs/as/Global.html"><code class="language-plaintext highlighter-rouge">.globl</code></a>, it’s a weak symbol, and I can place up to
one per translation unit, all covered by the same GDB breakpoint item
(less useful than it sounds). I haven’t actually checked, but I probably
more often use <code class="language-plaintext highlighter-rouge">dprintf</code> with such named lines than actual breakpoints.</p>

<p>If you have similar tips and tricks of your own, I’d like to learn about
them!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>So you want custom allocator support in your C library</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/12/17/"/>
    <id>urn:uuid:1ffa33fe-c701-4cf7-b8fb-6c30a14497a3</id>
    <updated>2023-12-17T17:52:26Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=38675379">on Hacker News</a> and <a href="https://old.reddit.com/r/C_Programming/comments/18ks5qg/">on reddit</a>.</em></p>

<p>Users of mature C libraries conventionally get to choose how memory is
allocated — that is, when it <a href="/blog/2018/06/10/">cannot be avoided entirely</a>. The C
standard never laid down a convention — <a href="/blog/2023/02/11/">perhaps for the better</a> —
so each library re-invents an allocator interface. Not all are created
equal, and most repeat a few fundamental mistakes. Often the interface is
merely a token effort, to check off that it’s “supported” without actual
consideration to its use. This article describes the critical features of
a practical allocator interface, and demonstrates why they’re important.</p>

<!--more-->

<p>Before diving into the details, here’s the checklist for library authors:</p>

<ol>
  <li>All allocation functions accept a user-defined context pointer.</li>
  <li>The “free” function accepts the original allocation size.</li>
  <li>The “realloc” function accepts both old and new size.</li>
</ol>

<h3 id="context-pointer">Context pointer</h3>

<p>The standard library allocator keeps its state in global variables. This
makes for a simple interface, but comes with significant performance and
complexity costs. These costs likely motivate custom allocator use in the
first place, in which case slavishly duplicating the standard interface is
essentially the worst possible option. Unfortunately this is typical:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define LIB_MALLOC  malloc
#define LIB_FREE    free
</span></code></pre></div></div>

<p>I could observe the library’s allocations, and I could swap in a library
functionality equivalent to the standard library allocator — jemalloc,
mimalloc, etc. — but that’s about it. Better than nothing, I suppose, but
only just so. Function pointer callbacks are slightly better:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">);</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="p">}</span> <span class="n">allocator</span><span class="p">;</span>

<span class="n">session</span> <span class="o">*</span><span class="nf">session_new</span><span class="p">(...,</span> <span class="n">allocator</span><span class="p">);</span>
</code></pre></div></div>

<p>At least I could use different allocators at different times, and there
are even <a href="/blog/2017/01/08/">tricks to bind a context pointer</a> to the callback. It
also works when the library is dynamically linked.</p>

<p>Either case barely qualifies as custom allocator support, and they’re
useless when it matters most. Only a small ingredient is needed to make
these interfaces useful: a context pointer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: Better, but still not great</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">size_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
    <span class="kt">void</span>   <span class="o">*</span><span class="n">ctx</span><span class="p">;</span>
<span class="p">}</span> <span class="n">allocator</span><span class="p">;</span>
</code></pre></div></div>

<p>Users can choose <em>from where</em> the library will allocate at at given time.
It liberates the allocator from global variables (or janky workarounds),
and multithreading woes. The default can still hook up to the standard
library through stubs that fit these interfaces.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">lib_malloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ctx</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">lib_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ctx</span><span class="p">;</span>
    <span class="n">free</span><span class="p">(</span><span class="n">ptr</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">allocator</span> <span class="n">lib_allocator</span> <span class="o">=</span> <span class="p">{</span><span class="n">lib_malloc</span><span class="p">,</span> <span class="n">lib_free</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
</code></pre></div></div>

<p>Note that the context pointer came after the “standard” arguments. All
things being equal, “extra” arguments should go after standard ones. But
don’t sweat it! In the most common calling conventions this allows stub
implementations to be merely an unconditional jump. It’s <em>as though</em> the
stubs are a kind of subtype of the original functions.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">lib_malloc:</span>
        <span class="nf">jmp</span> <span class="nv">malloc</span>
<span class="nl">lib_free:</span>
        <span class="nf">jmp</span> <span class="nv">free</span>
</code></pre></div></div>

<p>Typically the decision is completely arbitrary, and so this minutia tips
the balance.</p>

<h4 id="context-pointer-example">Context pointer example</h4>

<p>So what’s the big deal? It means we can trivially plug in, say, a <a href="/blog/2023/09/27/">tiny
arena allocator</a>. To demonstrate, consider this fictional string
set and partial JSON API, each of which supports a custom allocator. For
simplicity — I’m attempting to balance substance and brevity — they share
an allocator interface. (Note: Because <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">subscripts and sizes should be
signed</a>, and we’re now breaking away from the standard library
allocator, I will use <code class="language-plaintext highlighter-rouge">ptrdiff_t</code> for the rest of the examples.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">malloc</span><span class="p">)(</span><span class="kt">ptrdiff_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
    <span class="kt">void</span>  <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
    <span class="kt">void</span>   <span class="o">*</span><span class="n">ctx</span><span class="p">;</span>
<span class="p">}</span> <span class="n">allocator</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">set</span> <span class="n">set</span><span class="p">;</span>
<span class="n">set</span>  <span class="o">*</span><span class="nf">set_new</span><span class="p">(</span><span class="n">allocator</span> <span class="o">*</span><span class="p">);</span>
<span class="n">set</span>  <span class="o">*</span><span class="nf">set_free</span><span class="p">(</span><span class="n">set</span> <span class="o">*</span><span class="p">);</span>
<span class="n">bool</span>  <span class="nf">set_add</span><span class="p">(</span><span class="n">set</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">json</span> <span class="n">json</span><span class="p">;</span>
<span class="n">json</span>     <span class="o">*</span><span class="nf">json_load</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">allocator</span> <span class="o">*</span><span class="p">);</span>
<span class="n">json</span>     <span class="o">*</span><span class="nf">json_free</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">ptrdiff_t</span> <span class="nf">json_length</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">);</span>
<span class="n">json</span>     <span class="o">*</span><span class="nf">json_subscript</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">i</span><span class="p">);</span>
<span class="n">json</span>     <span class="o">*</span><span class="nf">json_getfield</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">field</span><span class="p">);</span>
<span class="kt">double</span>    <span class="nf">json_getnumber</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">char</span>     <span class="o">*</span><span class="nf">json_getstring</span><span class="p">(</span><span class="n">json</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">set</code> and <code class="language-plaintext highlighter-rouge">json</code> objects retain a copy of the <code class="language-plaintext highlighter-rouge">allocator</code> object for all
allocations made through that object. Given nothing, they default to the
standard library using the pass-through definitions above. Used together
with the standard library allocator:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">double</span> <span class="n">sum</span><span class="p">;</span>
    <span class="n">bool</span>   <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">sum_result</span><span class="p">;</span>

<span class="n">sum_result</span> <span class="nf">sum_unique</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">json</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sum_result</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">json</span> <span class="o">*</span><span class="n">namevals</span> <span class="o">=</span> <span class="n">json_load</span><span class="p">(</span><span class="n">json</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">namevals</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// parse error</span>
    <span class="p">}</span>

    <span class="kt">ptrdiff_t</span> <span class="n">arraylen</span> <span class="o">=</span> <span class="n">json_length</span><span class="p">(</span><span class="n">namevals</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arraylen</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">json_free</span><span class="p">(</span><span class="n">namevals</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// not an array</span>
    <span class="p">}</span>

    <span class="n">set</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="n">set_new</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">arraylen</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">json</span> <span class="o">*</span><span class="n">element</span> <span class="o">=</span> <span class="n">json_subscript</span><span class="p">(</span><span class="n">namevals</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">name</span>    <span class="o">=</span> <span class="n">json_getfield</span><span class="p">(</span><span class="n">element</span><span class="p">,</span> <span class="s">"name"</span><span class="p">);</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">value</span>   <span class="o">=</span> <span class="n">json_getfield</span><span class="p">(</span><span class="n">element</span><span class="p">,</span> <span class="s">"value"</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">name</span> <span class="o">||</span> <span class="o">!</span><span class="n">value</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">set_free</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
            <span class="n">json_free</span><span class="p">(</span><span class="n">namevals</span><span class="p">);</span>
            <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// invalid element</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">set_add</span><span class="p">(</span><span class="n">set</span><span class="p">,</span> <span class="n">name</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">sum</span> <span class="o">+=</span> <span class="n">json_getnumber</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="n">set_free</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
    <span class="n">json_free</span><span class="p">(</span><span class="n">namevals</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which given as JSON input:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"foo"</span><span class="p">,</span><span class="w"> </span><span class="nl">"value"</span><span class="p">:</span><span class="w">  </span><span class="mi">123</span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bar"</span><span class="p">,</span><span class="w"> </span><span class="nl">"value"</span><span class="p">:</span><span class="w">  </span><span class="mi">456</span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"foo"</span><span class="p">,</span><span class="w"> </span><span class="nl">"value"</span><span class="p">:</span><span class="w"> </span><span class="mi">1000</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<p>Would return <code class="language-plaintext highlighter-rouge">579.0</code>. Because it’s using standard library allocation, it
must carefully clean up before returning. There’s also no out-of-memory
handling because, in practice, programs typically do not get to observe
and respond to the standard allocator running out of memory.</p>

<p>We can improve and simplify it with an arena allocator:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>    <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span>    <span class="o">*</span><span class="n">end</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="o">*</span><span class="n">oom</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">arena_malloc</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">available</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">alignment</span> <span class="o">=</span> <span class="o">-</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">available</span><span class="o">-</span><span class="n">alignment</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">longjmp</span><span class="p">(</span><span class="o">*</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">oom</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-=</span> <span class="n">size</span> <span class="o">+</span> <span class="n">alignment</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">arena_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// nothing to do (yet!)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m allocating from the end rather than the beginning because it will make
a later change simpler. Applying that to the function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sum_result</span> <span class="nf">sum_unique</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">json</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sum_result</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

    <span class="n">allocator</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">malloc</span> <span class="o">=</span> <span class="n">arena_malloc</span><span class="p">;</span>
    <span class="n">a</span><span class="p">.</span><span class="n">free</span> <span class="o">=</span> <span class="n">arena_free</span><span class="p">;</span>
    <span class="n">a</span><span class="p">.</span><span class="n">ctx</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">;</span>

    <span class="n">json</span> <span class="o">*</span><span class="n">namevals</span> <span class="o">=</span> <span class="n">json_load</span><span class="p">(</span><span class="n">json</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">a</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">namevals</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// parse error</span>
    <span class="p">}</span>

    <span class="kt">ptrdiff_t</span> <span class="n">arraylen</span> <span class="o">=</span> <span class="n">json_length</span><span class="p">(</span><span class="n">namevals</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arraylen</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// not an array</span>
    <span class="p">}</span>

    <span class="n">set</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="n">set_new</span><span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">arraylen</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">json</span> <span class="o">*</span><span class="n">element</span> <span class="o">=</span> <span class="n">json_subscript</span><span class="p">(</span><span class="n">namevals</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">name</span>    <span class="o">=</span> <span class="n">json_getfield</span><span class="p">(</span><span class="n">element</span><span class="p">,</span> <span class="s">"name"</span><span class="p">);</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">value</span>   <span class="o">=</span> <span class="n">json_getfield</span><span class="p">(</span><span class="n">element</span><span class="p">,</span> <span class="s">"value"</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">name</span> <span class="o">||</span> <span class="o">!</span><span class="n">value</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">r</span><span class="p">;</span>  <span class="c1">// invalid element</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">set_add</span><span class="p">(</span><span class="n">set</span><span class="p">,</span> <span class="n">name</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">r</span><span class="p">.</span><span class="n">sum</span> <span class="o">+=</span> <span class="n">json_getnumber</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Calls to <code class="language-plaintext highlighter-rouge">set_free</code> and <code class="language-plaintext highlighter-rouge">json_free</code> are no longer necessary because the
arena automatically frees these on any return, in O(1). I almost feel bad
the library authors bothered to write them! It also handles allocation
failure without introducing it to <code class="language-plaintext highlighter-rouge">sum_unique</code>. We may even deliberately
restrict the memory available to this function — perhaps because the input
is untrusted, and we want to quickly abort denial-of-service attacks — by
giving it a small arena, relying on out-of-memory to reject pathological
inputs.</p>

<p>There are so many possibilities unlocked by the context pointer.</p>

<h3 id="provide-the-original-allocation-size-when-freeing">Provide the original allocation size when freeing</h3>

<p>When an application frees an object it always has the original, requested
allocation size on hand. After all, it’s a necessary condition to use the
object correctly. In the simplest case it’s the size of the freed object’s
type: a static quantity. If it’s an array, then it’s a multiple of the
tracked capacity: a dynamic quantity. In any case the size is either known
statically or tracked dynamically by the application.</p>

<p>Yet <code class="language-plaintext highlighter-rouge">free()</code> does not accept a size, meaning that the allocator must track
the information redundantly! That’s a needless burden on custom
allocators, and with a bit of care a library can lift it.</p>

<p>This was noticed in C++, and WG21 added <a href="https://isocpp.org/files/papers/n3778.html">sized deallocation</a> in
C++14. It’s now the default on two of the three major implementations (and
probably not the two you’d guess). In other words, object size is so
readily available that it can mostly be automated away. Notable exception:
<code class="language-plaintext highlighter-rouge">operator new[]</code> and <code class="language-plaintext highlighter-rouge">operator delete[]</code> with trivial destructors. With
non-trivial destructors, <code class="language-plaintext highlighter-rouge">operator new[]</code> must track the array length for
its its own purposes <em>on top of libc bookkeeping</em>. In other words, array
allocations have their size stored in at least three different places! C23
later gained <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm">a similar <code class="language-plaintext highlighter-rouge">free_sized</code></a>.</p>

<p>That means the “free” interface should look like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">lib_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>And calls inside the library might look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lib_free</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">),</span> <span class="n">ctx</span><span class="p">);</span>
<span class="n">lib_free</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">a</span><span class="p">)</span><span class="o">*</span><span class="n">len</span><span class="p">,</span> <span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>Now that <code class="language-plaintext highlighter-rouge">arena_free</code> has size information, it can free an allocation if
it was the most recent:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">arena_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ptr</span> <span class="o">==</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">ptrdiff_t</span> <span class="n">alignment</span> <span class="o">=</span> <span class="o">-</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">15</span><span class="p">;</span>
        <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">+=</span> <span class="n">size</span> <span class="o">+</span> <span class="n">alignment</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the library allocates short-lived objects to compute some value, then
discards in reverse order, the memory can be reused. The arena doesn’t
have to do anything special. The library merely needs to share its
knowledge with the allocator.</p>

<p>Beyond arena allocation, an allocator could use the size to locate the
allocation’s size class and, say, push it onto a freelist of its size
class. <a href="https://www.youtube.com/watch?v=LIb3L4vKZ7U">Size-class freelists compose well with arenas</a>, and an
implementation is short and simple when the caller of “free” communicates
object size.</p>

<p>Another idea: During testing, use a debug allocator that tracks object
size and validates the reported size against its own bookkeeping. This can
help catch mistakes sooner.</p>

<h3 id="provide-the-old-size-when-resizing-an-allocation">Provide the old size when resizing an allocation</h3>

<p>Resizing an allocation requires a lot from an allocator, and it should be
avoided if possible. At the very least it cannot be done <em>at all</em> without
knowing the original allocation size. An allocator can’t simply no-op it
like it can with “free.” With the standard library interface, allocators
have no choice but to redundantly track object sizes when “realloc” is
required.</p>

<p>So, just as with “free,” the allocator should be given the old object
size!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">lib_realloc</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">old</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">new</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>At the very least, an allocator could implement “realloc” with “malloc”
and <code class="language-plaintext highlighter-rouge">memcpy</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">arena_realloc</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">old</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">new</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">new</span> <span class="o">&gt;</span> <span class="n">old</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="n">arena_malloc</span><span class="p">(</span><span class="n">new</span><span class="p">,</span> <span class="n">ctx</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">memcpy</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">old</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Of the three checklist items, this is the most neglected. Exercise for the
reader: The last-allocated object <em>can</em> be resized in place, instead using
<code class="language-plaintext highlighter-rouge">memmove</code>. If this is frequently expected, allocate from the front, adjust
<code class="language-plaintext highlighter-rouge">arena_free</code> as needed, and extend the allocation in place <a href="/blog/2023/10/05/#addendum-extend-the-last-allocation">as discussed a
previous addendum</a>, without any copying.</p>

<h3 id="real-world-examples">Real world examples</h3>

<p>Let’s examine real world examples to see how well they fit the checklist.
First up is <a href="https://troydhanson.github.io/uthash/userguide.html#_hooks">uthash</a>, a popular, easy-to-use, intrusive hash table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define uthash_malloc(sz) my_malloc(sz)
#define uthash_free(ptr, sz) my_free(ptr)
</span></code></pre></div></div>

<p>No “realloc” so it trivially checks (3). It optionally provides the old
size to “free” which checks (2). However it misses (1) which is the most
important, greatly limiting its usefulness.</p>

<p>Next is the venerable <a href="https://www.zlib.net/manual.html">zlib</a>. It has function pointers with these
prototypes on its <code class="language-plaintext highlighter-rouge">z_stream</code> object.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">zlib_malloc</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">items</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">size</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">zlib_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">);</span>
</code></pre></div></div>

<p>The context pointer checks (1), and I can confirm from experience that
it’s genuinely useful with a custom allocator. No “realloc” so it passes
(3) automatically. It misses (2), but in practice this hardly matters: It
allocates everything up front, and frees at the very end, meaning a no-op
“free” is quite sufficient.</p>

<p>Finally there’s the <a href="https://www.lua.org/manual/5.4/manual.html#lua_Alloc">Lua programming language</a> with this economical,
single-function interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">lua_Alloc</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">old</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">new</span><span class="p">);</span>
</code></pre></div></div>

<p>It packs all three allocator functions into one function. It includes a
context pointer (1), a free size (2), and two realloc sizes (3). It’s a
simple allocator’s best friend!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>My personal C coding style as of late 2023</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/10/08/"/>
    <id>urn:uuid:60db7343-43f1-469f-9e9a-8af4d4c46b5a</id>
    <updated>2023-10-08T23:30:57Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=37815674">on Hacker News</a> and <a href="https://old.reddit.com/r/C_Programming/comments/173e0vn/">on reddit</a>.</em></p>

<p>This has been a ground-breaking year for my C skills, and paradigm shifts
in my technique has provoked me to reconsider my habits and coding style.
It’s been my largest personal style change in years, so I’ve decided to
take a snapshot of its current state and my reasoning. These changes have
produced significant productive and organizational benefits, so while most
is certainly subjective, it likely includes a few objective improvements.
I’m not saying everyone should write C this way, and when I contribute
code to a project I follow their local style. This is about what works
well for me.</p>

<!--more-->

<h3 id="primitive-types">Primitive types</h3>

<p>Starting with the fundamentals, I’ve been using short names for primitive
types. The resulting clarity was more than I had expected, and it’s made
my code more enjoyable to review. These names appear frequently throughout
a program, so conciseness pays. Also, now that I’ve gone without, <code class="language-plaintext highlighter-rouge">_t</code>
suffixes are more visually distracting than I had realized.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">uint8_t</span>   <span class="n">u8</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">char16_t</span>  <span class="n">c16</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">int32_t</span>   <span class="n">b32</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">int32_t</span>   <span class="n">i32</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">uint32_t</span>  <span class="n">u32</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">uint64_t</span>  <span class="n">u64</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">float</span>     <span class="n">f32</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">double</span>    <span class="n">f64</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">uintptr_t</span> <span class="n">uptr</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">char</span>      <span class="n">byte</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">size_t</span>    <span class="n">usize</span><span class="p">;</span>
</code></pre></div></div>

<p>Some people prefer an <code class="language-plaintext highlighter-rouge">s</code> prefix for signed types. I prefer <code class="language-plaintext highlighter-rouge">i</code>, plus as
you’ll see, I have other designs for <code class="language-plaintext highlighter-rouge">s</code>. For sizes, <code class="language-plaintext highlighter-rouge">isize</code> would be more
consistent, and wouldn’t hog the identifier, but <a href="/blog/2023/09/27/">signed sizes are the
way</a> and so I want them in a place of privilege. <code class="language-plaintext highlighter-rouge">usize</code> is niche,
mainly for interacting with external interfaces where it might matter.</p>

<p><code class="language-plaintext highlighter-rouge">b32</code> is a “32-bit boolean” and communicates intent. I could use <code class="language-plaintext highlighter-rouge">_Bool</code>,
but I’d rather stick to a natural word size and stay away from its weird
semantics. To beginners it might seem like “wasting memory” by using a
32-bit boolean, but in practice that’s never the case. It’s either in a
register (return value, local variable) or would be padded anyway (struct
field). When it actually matters, I pack booleans into a <code class="language-plaintext highlighter-rouge">flags</code> variable,
and a 1-byte boolean rarely important.</p>

<p>While UTF-16 might seem niche, it’s a necessary evil when dealing with
Win32, so <code class="language-plaintext highlighter-rouge">c16</code> (“16-bit character”) has made a frequent appearance. I
could have based it on <code class="language-plaintext highlighter-rouge">uint16_t</code>, but putting the name <code class="language-plaintext highlighter-rouge">char16_t</code> in its
“type hierarchy” communicates to debuggers, particularly GDB, that for
display purposes these variables hold character data. Officially Win32
uses a type named <code class="language-plaintext highlighter-rouge">wchar_t</code>, but I like being explicit about UTF-16.</p>

<p><code class="language-plaintext highlighter-rouge">u8</code> is for octets, usually UTF-8 data. It’s distinct from <code class="language-plaintext highlighter-rouge">byte</code>, which
represents raw memory and is a special <em>aliasing</em> type. In theory these
can be distinct types with differing semantics, though I’m not aware of
any implementation that does so (yet?). For now it’s about intent.</p>

<p>What about systems that don’t support fixed width types? That’s academic,
and far too much time has been wasted worrying about it. That includes
time wasted on typing out <code class="language-plaintext highlighter-rouge">int_fast32_t</code> and similar nonsense. Virtually
no existing software would actually work correctly on such systems — I’m
certain nobody’s <em>testing</em> it after all — so it seems nobody else cares
either.</p>

<p>I don’t intend to use these names in isolation, such as in code snippets
(outside of this article). If I did, examples would require the <code class="language-plaintext highlighter-rouge">typedefs</code>
to give readers the complete context. That’s not worth extra explanation.
Even in the most recent articles I’ve used <code class="language-plaintext highlighter-rouge">ptrdiff_t</code> instead of <code class="language-plaintext highlighter-rouge">size</code>.</p>

<h3 id="macros">Macros</h3>

<p>Next, some “standard” macros:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define countof(a)    (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s)   (countof(s) - 1)
#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)
</span></code></pre></div></div>

<p>While I still prefer <code class="language-plaintext highlighter-rouge">ALL_CAPS</code> for constants, I’ve adopted lowercase for
function-like macros because it’s nicer to read. They don’t have the same
namespace problems as other macro definitions: I can have a macro named
<code class="language-plaintext highlighter-rouge">new()</code> and also variables and fields named <code class="language-plaintext highlighter-rouge">new</code> because they don’t look
like function calls.</p>

<p>For GCC and Clang, my favorite <code class="language-plaintext highlighter-rouge">assert</code> macro now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define assert(c)  while (!(c)) __builtin_unreachable()
</span></code></pre></div></div>

<p>It has useful properties beyond <a href="/blog/2022/06/26/">the usual benefits</a>:</p>

<ul>
  <li>
    <p>It does not require separate definitions for debug and release builds.
Instead it’s controlled by the presence of Undefined Behavior Sanitizer
(UBSan), which is already present/absent in these circumstances. That
includes <a href="/blog/2019/01/25/">fuzz testing</a>.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">libubsan</code> provides a diagnostic printout with a file and line number.</p>
  </li>
  <li>
    <p>In release builds it turns into a practical optimization hint.</p>
  </li>
</ul>

<p>To enable assertions in release builds, put UBSan in trap mode with
<code class="language-plaintext highlighter-rouge">-fsanitize-trap</code> and then enable at least <code class="language-plaintext highlighter-rouge">-fsanitize=unreachable</code>. In
theory this can also be done with <code class="language-plaintext highlighter-rouge">-funreachable-traps</code>, but as of this
writing it’s been broken for the past few GCC releases.</p>

<h3 id="parameters-and-functions">Parameters and functions</h3>

<p>No <code class="language-plaintext highlighter-rouge">const</code>. It serves no practical role in optimization, and <strong>I cannot
recall an instance where it caught, or would have caught, a mistake</strong>. I
held out for awhile as prototype documentation, but on reflection I found
that good parameter names were sufficient. Dropping <code class="language-plaintext highlighter-rouge">const</code> has made me
noticeably more productive by reducing cognitive load and eliminating
visual clutter. I now believe its inclusion in C was a costly mistake.</p>

<p>(One small exception: I still like it as a hint to place static tables in
read-only memory closer to the code. I’ll cast away the <code class="language-plaintext highlighter-rouge">const</code> if needed.
This is only of minor importance.)</p>

<p>Literal <code class="language-plaintext highlighter-rouge">0</code> for null pointers. Short and sweet. This is not new, but a
style I’ve used for about 7 years now, and has appeared all over my
writing since. There are some theoretical edge cases where it may cause
defects, and lots of <a href="https://ljabl.com/nullptr.xhtml">ink has been spilled</a> on the subject, but
after a couple 100K lines of code I’ve yet to see it happen.</p>

<p><code class="language-plaintext highlighter-rouge">restrict</code> when necessary, but better to organize code so that it’s not,
e.g. don’t write to “out” parameters in loops, or don’t use out parameters
at all (more on that momentarily). I don’t bother with <code class="language-plaintext highlighter-rouge">inline</code> because I
compile everything as one translation unit anyway.</p>

<p><code class="language-plaintext highlighter-rouge">typedef</code> all structures. I used to shy away from it, but eliminating the
<code class="language-plaintext highlighter-rouge">struct</code> keyword makes code easier to read. If it’s a recursive structure,
use a forward declaration immediately above so that such fields can use
the short name:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">map</span> <span class="n">map</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">map</span> <span class="p">{</span>
    <span class="n">map</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="c1">// ...</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Declare all functions <code class="language-plaintext highlighter-rouge">static</code> except for entry points. Again, with
everything compiled as a single translation unit there’s no reason to do
otherwise. It was probably a mistake for C not to default to <code class="language-plaintext highlighter-rouge">static</code>,
though I don’t have a strong opinion on the matter. With the clutter
eliminated through short types, no <code class="language-plaintext highlighter-rouge">const</code>, no <code class="language-plaintext highlighter-rouge">struct</code>, etc. <strong>functions
fit comfortably on the same line as their return type</strong>. I used to break
them apart so that the function name began on its own line, but that’s no
longer necessary.</p>

<p>In my writing I sometimes omit <code class="language-plaintext highlighter-rouge">static</code> to simplify, and because outside
the context of a complete program it’s mostly irrelevant. However, I will
use it below to emphasize this style.</p>

<p>For awhile I capitalized type names as that effectively put them in a kind
of namespace apart from variables and functions, but I eventually stopped.
I may try this idea in different way in the future.</p>

<h3 id="strings">Strings</h3>

<p>One of my most productive changes this year has been the total rejection
of null terminated strings — another of those terrible mistakes — and the
embrace of <a href="/blog/2023/01/18/#implementation-highlights">this basic string type</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define s8(s) (s8){(u8 *)s, lengthof(s)}
</span><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">u8</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">s8</span><span class="p">;</span>
</code></pre></div></div>

<p>I’ve used a few names for it, but this is my favorite. The <code class="language-plaintext highlighter-rouge">s</code> is for
string, and the <code class="language-plaintext highlighter-rouge">8</code> is for UTF-8 or <code class="language-plaintext highlighter-rouge">u8</code>. The <code class="language-plaintext highlighter-rouge">s8</code> macro (sometimes just
spelled <code class="language-plaintext highlighter-rouge">S</code>) wraps a C string literal, making a <code class="language-plaintext highlighter-rouge">s8</code> string out of it. A
<code class="language-plaintext highlighter-rouge">s8</code> is handled like a <a href="/blog/2019/06/30/">fat pointer</a>, passed and returned by copy.
<code class="language-plaintext highlighter-rouge">s8</code> makes for a great function prefix, unlike <code class="language-plaintext highlighter-rouge">str</code>, all of which are
reserved. Some examples:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">s8</span>   <span class="nf">s8span</span><span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="n">b32</span>  <span class="nf">s8equals</span><span class="p">(</span><span class="n">s8</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>
<span class="k">static</span> <span class="n">size</span> <span class="nf">s8compare</span><span class="p">(</span><span class="n">s8</span><span class="p">,</span> <span class="n">s8</span><span class="p">);</span>
<span class="k">static</span> <span class="n">u64</span>  <span class="nf">s8hash</span><span class="p">(</span><span class="n">s8</span><span class="p">);</span>
<span class="k">static</span> <span class="n">s8</span>   <span class="nf">s8trim</span><span class="p">(</span><span class="n">s8</span><span class="p">);</span>
<span class="k">static</span> <span class="n">s8</span>   <span class="nf">s8clone</span><span class="p">(</span><span class="n">s8</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Then when combined with the macro:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">s8equals</span><span class="p">(</span><span class="n">tagname</span><span class="p">,</span> <span class="n">s8</span><span class="p">(</span><span class="s">"body"</span><span class="p">)))</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>You might be tempted to use a flexible array member to pack the size and
array together as one allocation. Tried it. Its inflexibility is totally
not worth whatever benefits it might have. Consider, for instance, how
you’d create such a string out of a literal, and how it would be used.</p>

<p>A few times I’ve thought, “This program is simple enough that I don’t need
a string type for this data.” That thought is nearly always wrong. Having
it available helps me think more clearly, and makes for simpler programs.
(C++ got it only a few years ago with <code class="language-plaintext highlighter-rouge">std::string_view</code> and <code class="language-plaintext highlighter-rouge">std::span</code>.)</p>

<p>It has a natural UTF-16 counterpart, <code class="language-plaintext highlighter-rouge">s16</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define s16(s) (s16){u##s, lengthof(u##s)}
</span><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">c16</span> <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="n">size</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">s16</span><span class="p">;</span>
</code></pre></div></div>

<p>I’m not entirely sold on gluing <code class="language-plaintext highlighter-rouge">u</code> to the literal in the macro, versus
writing it out on the string literal.</p>

<h3 id="more-structures">More structures</h3>

<p>Another change has been preferring structure returns instead of out
parameters. It’s effectively a multiple value return, though without
destructuring. A great organizational change. For example, this function
returns two values, a parse result and a status:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">i32</span> <span class="n">value</span><span class="p">;</span>
    <span class="n">b32</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">i32parsed</span><span class="p">;</span>

<span class="k">static</span> <span class="n">i32parsed</span> <span class="nf">i32parse</span><span class="p">(</span><span class="n">s8</span><span class="p">);</span>
</code></pre></div></div>

<p>Worried about the “extra copying?” Have no fear, because in practice
calling conventions turn this into a hidden, <code class="language-plaintext highlighter-rouge">restrict</code>-qualified out
parameter — if it’s not inlined such that any return value overhead would
be irrelevant anyway. With this return style I’m less tempted to use
in-band signals like special null returns to indicate errors, which is
less clear.</p>

<p>It’s also led to a style of defining a zero-initialized return value at
the top of the function, i.e. <code class="language-plaintext highlighter-rouge">ok</code> is false, and then use it for all
<code class="language-plaintext highlighter-rouge">return</code> statements. On error, it can bail out with an immediate return.
The success path sets <code class="language-plaintext highlighter-rouge">ok</code> to true before the return.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">i32parsed</span> <span class="nf">i32parse</span><span class="p">(</span><span class="n">s8</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">i32parsed</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">size</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">u8</span> <span class="n">digit</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="sc">'0'</span><span class="p">;</span>
        <span class="c1">// ...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">overflow</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">r</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">r</span><span class="p">.</span><span class="n">value</span><span class="o">*</span><span class="mi">10</span> <span class="o">+</span> <span class="n">digit</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Aside from static data, I’ve also moved away from initializers except the
conventional zero initializer. (Notable exception: <code class="language-plaintext highlighter-rouge">s8</code> and <code class="language-plaintext highlighter-rouge">s16</code> macros.)
This includes designated initializers. Instead I’ve been initializing with
assignments. For example, this <a href="/blog/2023/02/13/">buffered output</a> “constructor”:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">u8</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="n">i32</span> <span class="n">len</span><span class="p">;</span>
    <span class="n">i32</span> <span class="n">cap</span><span class="p">;</span>
    <span class="n">i32</span> <span class="n">fd</span><span class="p">;</span>
    <span class="n">b32</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span> <span class="n">u8buf</span><span class="p">;</span>

<span class="k">static</span> <span class="n">u8buf</span> <span class="nf">newu8buf</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">,</span> <span class="n">i32</span> <span class="n">cap</span><span class="p">,</span> <span class="n">i32</span> <span class="n">fd</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u8buf</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">buf</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">u8</span><span class="p">,</span> <span class="n">cap</span><span class="p">);</span>
    <span class="n">r</span><span class="p">.</span><span class="n">cap</span> <span class="o">=</span> <span class="n">cap</span><span class="p">;</span>
    <span class="n">r</span><span class="p">.</span><span class="n">fd</span>  <span class="o">=</span> <span class="n">fd</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I like how this reads, but it also eliminates a cognitive burden: The
assignments are separated by sequence points, giving them an explicit
order. It doesn’t matter here, but in other cases it does:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">example</span> <span class="n">e</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">randname</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">),</span>
        <span class="p">.</span><span class="n">age</span>  <span class="o">=</span> <span class="n">randage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">),</span>
        <span class="p">.</span><span class="n">seat</span> <span class="o">=</span> <span class="n">randseat</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">),</span>
    <span class="p">};</span>
</code></pre></div></div>

<p>There are 6 possible values for <code class="language-plaintext highlighter-rouge">e</code> from the same seed. I like no longer
thinking about these possibilities.</p>

<h3 id="odds-and-ends">Odds and ends</h3>

<p>Prefer <code class="language-plaintext highlighter-rouge">__attribute</code> to <code class="language-plaintext highlighter-rouge">__attribute__</code>. The <code class="language-plaintext highlighter-rouge">__</code> suffix is excessive and
unnecessary.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">malloc</span><span class="p">,</span> <span class="n">warn_unused_result</span><span class="p">))</span>
</code></pre></div></div>

<p>For Win32 systems programming, which typically only requires a modest
number of declarations and definitions, rather than include <code class="language-plaintext highlighter-rouge">windows.h</code>,
<a href="/blog/2023/05/31/">write the prototypes out by hand</a> using custom types. It reduces
build times, declutters namespaces, and interfaces more cleanly with the
program (no more <code class="language-plaintext highlighter-rouge">DWORD</code>/<code class="language-plaintext highlighter-rouge">BOOL</code>/<code class="language-plaintext highlighter-rouge">ULONG_PTR</code>, but <code class="language-plaintext highlighter-rouge">u32</code>/<code class="language-plaintext highlighter-rouge">b32</code>/<code class="language-plaintext highlighter-rouge">uptr</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r) __declspec(dllimport) r __stdcall
</span><span class="n">W32</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>   <span class="n">ExitProcess</span><span class="p">(</span><span class="n">u32</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">i32</span><span class="p">)</span>    <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">u32</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">byte</span> <span class="o">*</span><span class="p">)</span> <span class="n">VirtualAlloc</span><span class="p">(</span><span class="n">byte</span> <span class="o">*</span><span class="p">,</span> <span class="n">usize</span><span class="p">,</span> <span class="n">u32</span><span class="p">,</span> <span class="n">u32</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">b32</span><span class="p">)</span>    <span class="n">WriteConsoleA</span><span class="p">(</span><span class="n">uptr</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="p">,</span> <span class="n">u32</span><span class="p">,</span> <span class="n">u32</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">b32</span><span class="p">)</span>    <span class="n">WriteConsoleW</span><span class="p">(</span><span class="n">uptr</span><span class="p">,</span> <span class="n">c16</span> <span class="o">*</span><span class="p">,</span> <span class="n">u32</span><span class="p">,</span> <span class="n">u32</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>For inline assembly, treat the outer parentheses like braces, put a space
before the opening parenthesis, just like <code class="language-plaintext highlighter-rouge">if</code>, and start each constraint
line with its colon.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">u64</span> <span class="nf">rdtscp</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">u32</span> <span class="n">hi</span><span class="p">,</span> <span class="n">lo</span><span class="p">;</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"rdtscp"</span>
        <span class="o">:</span> <span class="s">"=d"</span><span class="p">(</span><span class="n">hi</span><span class="p">),</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">lo</span><span class="p">)</span>
        <span class="o">:</span>
        <span class="o">:</span> <span class="s">"cx"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">u64</span><span class="p">)</span><span class="n">hi</span><span class="o">&lt;&lt;</span><span class="mi">32</span> <span class="o">|</span> <span class="n">lo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s surely a lot more to my style than this, but unlike the above,
those details haven’t changed this year. To see most of the mentioned
items in action in a small program, see <a href="https://github.com/skeeto/scratch/blob/master/misc/wordhist.c"><code class="language-plaintext highlighter-rouge">wordhist.c</code></a>, one of my
testing grounds for <a href="/blog/2023/09/30/">hash-tries</a>, or for a slightly larger program,
<a href="https://github.com/skeeto/scratch/blob/master/misc/asmint.c"><code class="language-plaintext highlighter-rouge">asmint.c</code></a>, a mini programming language implementation.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A simple, arena-backed, generic dynamic array for C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/10/05/"/>
    <id>urn:uuid:0c5f55d1-ca7c-4897-97ef-8a539e03bf34</id>
    <updated>2023-10-05T23:05:57Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>Previously I presented an <a href="/blog/2023/09/30/">arena-friendly hash map</a> applicable to any
programming language where one might use arena allocation. In this third
article I present a generic, arena-backed dynamic array. The details are
specific to C, as the most appropriate mechanism depends on the language
(e.g. templates, generics). Just as in the previous two articles, the goal
is to demonstrate an idea so simple that a full implementation fits on one
terminal pager screen — a <em>concept</em> rather than a <em>library</em>.</p>

<p>Unlike a hash map or linked list, a dynamic array — a data buffer with a
size that varies during run time — is more difficult to square with arena
allocation. They’re contiguous by definition, and we cannot resize objects
in the middle of an arena, i.e. <code class="language-plaintext highlighter-rouge">realloc</code>. So while convenient, they come
with trade-offs. At least until they stop growing, dynamic arrays are more
appropriate for shorter-lived, temporary contexts, where you would use a
scratch arena. On average they consume about twice the memory of a fixed
array of the same size.</p>

<p>As before, I begin with a motivating example of its use. The guts of the
generic dynamic array implementation are tucked away in a <code class="language-plaintext highlighter-rouge">push()</code> macro,
which is essentially the entire interface.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int32_t</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
<span class="p">}</span> <span class="n">int32s</span><span class="p">;</span>

<span class="n">int32s</span> <span class="nf">fibonacci</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">max</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">int32_t</span> <span class="n">init</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">};</span>
    <span class="n">int32s</span> <span class="n">fib</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">fib</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">init</span><span class="p">;</span>
    <span class="n">fib</span><span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="n">fib</span><span class="p">.</span><span class="n">cap</span> <span class="o">=</span> <span class="n">countof</span><span class="p">(</span><span class="n">init</span><span class="p">);</span>

    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">a</span> <span class="o">=</span> <span class="n">fib</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">fib</span><span class="p">.</span><span class="n">len</span><span class="o">-</span><span class="mi">2</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">b</span> <span class="o">=</span> <span class="n">fib</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">fib</span><span class="p">.</span><span class="n">len</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="o">+</span><span class="n">b</span> <span class="o">&gt;</span> <span class="n">max</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">fib</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">fib</span><span class="p">,</span> <span class="n">perm</span><span class="p">)</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Anyone familiar with Go will quickly notice a pattern: <code class="language-plaintext highlighter-rouge">int32s</code> looks an
awful lot like a <a href="https://go.dev/blog/slices-intro">Go <em>slice</em></a>. That was indeed my inspiration, and
there is enough context that you could infer <a href="/blog/2019/06/30/">similar semantics</a>. I
will even call these “slice headers.” Initially I tried a design based on
<a href="https://github.com/nothings/stb/blob/master/deprecated/stretchy_buffer.txt">stretchy buffers</a>, but I didn’t like the macros nor the ergonomics.</p>

<p>I wouldn’t write a <code class="language-plaintext highlighter-rouge">fibonacci</code> this way in practice, but it’s useful for
highlighting certain features. Of particular note:</p>

<ul>
  <li>
    <p>The dynamic array initially wraps a static array, yet I can append to it
as though it were a dynamic allocation. If I don’t append at all, it
still works. (Though of course the caller then shouldn’t modify the
elements.)</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">push()</code> operates on any object which is <em>slice-shaped</em>. That is it has
a pointer field named <code class="language-plaintext highlighter-rouge">data</code>, a <code class="language-plaintext highlighter-rouge">ptrdiff_t</code> length field named <code class="language-plaintext highlighter-rouge">len</code>, a
<code class="language-plaintext highlighter-rouge">ptrdiff_t</code> capacity field named <code class="language-plaintext highlighter-rouge">cap</code>, and all in that order.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">push()</code> evaluates to a pointer to the newly-pushed element. In my
example I immediately dereference and assign a value.</p>
  </li>
  <li>
    <p>An element is zero-initialized the first time it’s pushed. I say “first
time” because you can truncate an array by reducing <code class="language-plaintext highlighter-rouge">len</code>, and “pushing”
afterward will simply reveal the original elements.</p>
  </li>
  <li>
    <p>The name <code class="language-plaintext highlighter-rouge">int32s</code> is intended to evoke plurality. I’ll use this
convention again in a moment.</p>
  </li>
  <li>
    <p>The arena passed to <code class="language-plaintext highlighter-rouge">push()</code> is only used if the array needs to grow.
The new backing array will be allocated out of this arena regardless of
the original backing array.</p>
  </li>
  <li>
    <p>Resizes always change the backing array address, and the old array
remains valid. This is also just like slices in Go.</p>
  </li>
  <li>
    <p>Despite the name <code class="language-plaintext highlighter-rouge">perm</code>, I expect it points to the caller’s scratch
arena. It’s “permanent” only relative to the <code class="language-plaintext highlighter-rouge">fibonacci</code> call. Otherwise
I might build the array in a scratch arena, then create a final copy in
a permanent arena.</p>
  </li>
</ul>

<p>For a slightly more realistic example: rendering triangles. Suppose we
need data in array format for OpenGL, but we don’t know the number of
vertices ahead of time. A dynamic array is convenient, especially if we
discard the array as soon as OpenGL is done with it. We could build up
entire scenes like this for each display frame.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
     <span class="n">GLfloat</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">;</span>
<span class="p">}</span> <span class="n">GLvert</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">GLvert</span>   <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
<span class="p">}</span> <span class="n">GLverts</span><span class="p">;</span>

<span class="kt">void</span> <span class="nf">renderobj</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">GLverts</span> <span class="n">vs</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">objparser</span> <span class="n">parser</span> <span class="o">=</span> <span class="n">newobjparser</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vs</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="o">=</span> <span class="n">nextvert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">parser</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">glVertexPointer</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">GL_FLOAT</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">vs</span><span class="p">.</span><span class="n">data</span><span class="p">);</span>
    <span class="n">glDrawArrays</span><span class="p">(</span><span class="n">GL_TRIANGLES</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">vs</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As before, <code class="language-plaintext highlighter-rouge">GLverts</code> is slice-shaped. This time it’s zero-initialized,
which is a valid empty dynamic array. As with maps, that means any object
with such a field comes with a ready-to-use empty dynamic array. Putting
it together, here’s an example that gradually appends vertices to named
dynamic arrays, randomly accessed by string name:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">map</span>    <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">str</span>     <span class="n">name</span><span class="p">;</span>
    <span class="n">GLverts</span> <span class="n">verts</span><span class="p">;</span>
<span class="p">}</span> <span class="n">map</span><span class="p">;</span>

<span class="n">verts</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">map</span> <span class="o">**</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// from the last article</span>

<span class="n">map</span> <span class="o">*</span><span class="nf">example</span><span class="p">(...,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">map</span> <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="n">str</span> <span class="n">name</span> <span class="o">=</span> <span class="p">...;</span>
        <span class="n">vert</span> <span class="n">v</span> <span class="o">=</span> <span class="p">...;</span>
        <span class="n">verts</span> <span class="o">*</span><span class="n">vs</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">m</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="n">vs</span><span class="p">,</span> <span class="n">perm</span><span class="p">)</span> <span class="o">=</span> <span class="n">v</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">m</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s what Go would call <code class="language-plaintext highlighter-rouge">map[str][]vert</code>, but allocated entirely out of
an arena. Ever thought C could do this so simply and conveniently? The
memory allocator (~15 lines), map (~30 lines), dynamic array (~30 lines),
constructors (0 lines), and destructors (0 lines) that power this total to
~75 lines of zero-dependency code!</p>

<h3 id="implementation-details">Implementation details</h3>

<p>I despise macro abuse, and programs substantially implemented in macros
are annoying. They’re difficult to understand and debug. A good dynamic
array implementation will require a macro, and one of my goals was to keep
it as simple and minimal as possible. The macro’s job is to:</p>

<ol>
  <li>Check the capacity and maybe grow the array via function call.</li>
  <li>Smuggle type information (i.e. <code class="language-plaintext highlighter-rouge">sizeof</code>) to that function.</li>
  <li>Compute a pointer of the proper type to the new element.</li>
</ol>

<p>Here’s what I came up with:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define push(s, arena) \
    ((s)-&gt;len &gt;= (s)-&gt;cap \
        ? grow(s, sizeof(*(s)-&gt;data), arena), \
          (s)-&gt;data + (s)-&gt;len++ \
        : (s)-&gt;data + (s)-&gt;len++)
</span></code></pre></div></div>

<p>The macro will be used as an expression, so it cannot use statements like
<code class="language-plaintext highlighter-rouge">if</code>. The condition is therefore a ternary operator. If it’s full, it
calls the supporting <code class="language-plaintext highlighter-rouge">grow</code> function. In either case, it computes the
result from <code class="language-plaintext highlighter-rouge">data</code>. In particular, note that the <code class="language-plaintext highlighter-rouge">grow</code> branch uses a
comma operator to <em>sequence</em> growth before pointer derivation, as <code class="language-plaintext highlighter-rouge">grow</code>
will change the value of <code class="language-plaintext highlighter-rouge">data</code> as a side effect.</p>

<p>To be generic, the <code class="language-plaintext highlighter-rouge">grow</code> function uses <code class="language-plaintext highlighter-rouge">memcpy</code>-based type punning:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">grow</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">slice</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">void</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">replica</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">replica</span><span class="p">,</span> <span class="n">slice</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">replica</span><span class="p">));</span>

    <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">=</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">?</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">align</span> <span class="o">=</span> <span class="mi">16</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">);</span>
    <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">replica</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">size</span><span class="o">*</span><span class="n">replica</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">replica</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>

    <span class="n">memcpy</span><span class="p">(</span><span class="n">slice</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">replica</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">replica</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The slice header is copied over a local replica, avoiding conflicts with
strict aliasing. This is the archetype slice header. It still requires
that different pointers have identical memory representation. That’s
virtually always true, and certainly true anywhere I’d use an arena.</p>

<p>If the capacity was zero, it behaves as though it was one, and so, through
doubling, zero-capacity arrays become capacity-2 arrays on the first push.
It’s better to let <code class="language-plaintext highlighter-rouge">alloc</code> — whose definition, you may recall, included an
overflow check — handle size overflow so that it can invoke the out of
memory policy, so instead of doubling <code class="language-plaintext highlighter-rouge">cap</code>, which would first require an
overflow check, it doubles the <em>object size</em>. This is a small constant
(i.e. from <code class="language-plaintext highlighter-rouge">sizeof</code>), so doubling it is always safe.</p>

<p>Copying over old data includes a special check for zero-length inputs,
because, <a href="/blog/2023/02/11/">quite frustratingly</a>, <code class="language-plaintext highlighter-rouge">memcpy</code> does not accept null even
when the length is zero. I check for zero length instead of null so that
it’s more sensitive to defects. If the pointer is null with a non-zero
length, it will trip Undefined Behavior Sanitizer, or at least crash the
program, rather than silently skip copying.</p>

<p>Finally the updated replica is copied over the original slice header,
updating it with the new <code class="language-plaintext highlighter-rouge">data</code> pointer and capacity. The original backing
array is untouched but is no longer referenced through this slice header.
Old slice headers will continue to function with the old backing array,
such as when the arena is reset to a point where the dynamic array was
smaller.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">int32s</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// resize: cap=2</span>
    <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">)</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>  <span class="c1">// resize: cap=4</span>
    <span class="p">{</span>
        <span class="n">arena</span> <span class="n">tmp</span> <span class="o">=</span> <span class="n">scratch</span><span class="p">;</span>  <span class="c1">// scoped arena</span>
        <span class="n">int32s</span> <span class="n">extended</span> <span class="o">=</span> <span class="n">vals</span><span class="p">;</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">extended</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">tmp</span><span class="p">)</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
        <span class="o">*</span><span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">extended</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">tmp</span><span class="p">)</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>  <span class="c1">// resize: cap=8</span>
        <span class="n">example</span><span class="p">(</span><span class="n">extended</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// vals still works, cap=4, extension freed</span>
</code></pre></div></div>

<p>In practice, a dynamic array comes from old backing arrays whose total
size adds up just shy of the current array capacity. For example, if the
current capacity is 16, old arrays are size 2+4+8 = 14.</p>

<p>If you’re worried about misuse, such as slice header fields being in the
wrong order, a couple of assertions can quickly catch such mistakes at run
time, typically under the lightest of testing. In fact, I planned for this
by using the more-sensitive <code class="language-plaintext highlighter-rouge">len&gt;=cap</code> instead of just <code class="language-plaintext highlighter-rouge">len==cap</code>, so that
it would direct execution towards assertions in <code class="language-plaintext highlighter-rouge">grow</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">assert</span><span class="p">(</span><span class="n">replica</span><span class="p">.</span><span class="n">len</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">replica</span><span class="p">.</span><span class="n">len</span> <span class="o">&lt;=</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">);</span>
</code></pre></div></div>

<p>This also demonstrates another benefit of <a href="https://web.archive.org/web/20151009055354/http://oss.sgi.com/archives/ogl-sample/2005-07/msg00003.html">signed sizes</a>: Exactly half
the range is invalid and so defects tend to quickly trip these assertions.</p>

<h3 id="alignment">Alignment</h3>

<p>Alignment is unfortunately fixed, and I picked a “safe” value of 16. In my
<code class="language-plaintext highlighter-rouge">new()</code> macro I used <code class="language-plaintext highlighter-rouge">_Alignof</code> to pass type information to <code class="language-plaintext highlighter-rouge">alloc</code>. <a href="https://groups.google.com/g/comp.std.c/c/v5hsWOu5vSw">Due
to an oversight</a>, unlike <code class="language-plaintext highlighter-rouge">sizeof</code>, <code class="language-plaintext highlighter-rouge">_Alignof</code> cannot be applied
to expressions, and so it cannot be used in dynamic arrays. GCC and Clang
support <code class="language-plaintext highlighter-rouge">_Alignof</code> on expressions just like <code class="language-plaintext highlighter-rouge">sizeof</code>, as it’s such an
obvious idea, but Microsoft chose to strictly follow the oversight in the
standard. To support MSVC, I’ve deliberately limited the capabilities of
<code class="language-plaintext highlighter-rouge">push</code>. If that doesn’t matter, fixing it is easy:</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">--- a/example.c
</span><span class="gi">+++ b/example.c
</span><span class="p">@@ -2,3 +2,3 @@</span>
     ((s)-&gt;len &gt;= (s)-&gt;cap \
<span class="gd">-        ? grow(s, sizeof(*(s)-&gt;data), arena), \
</span><span class="gi">+        ? grow(s, sizeof(*(s)-&gt;data), _Alignof(*(s)-&gt;data), arena), \
</span>           (s)-&gt;data + (s)-&gt;len++ \
<span class="p">@@ -6,3 +6,3 @@</span>
 
<span class="gd">-static void grow(void *slice, ptrdiff_t size, arena *a)
</span><span class="gi">+static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
</span> {
<span class="p">@@ -16,3 +16,2 @@</span>
     replica.cap = replica.cap ? replica.cap : 1;
<span class="gd">-    ptrdiff_t align = 16;
</span>     void *data = alloc(a, 2*size, align, replica.cap);
</code></pre></div></div>

<p>Though while you’re at it, if you’re already using extensions you might
want to switch <code class="language-plaintext highlighter-rouge">push</code> to a <a href="https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html">statement expression</a> so that the slice
header <code class="language-plaintext highlighter-rouge">s</code> does not get evaluated more than once — i.e. so that <code class="language-plaintext highlighter-rouge">upsert()</code>
in my example above could be used inside the <code class="language-plaintext highlighter-rouge">push()</code> expession.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define push(s, a) ({ \
    typeof(s) s_ = (s); \
    typeof(a) a_ = (a); \
    if (s_-&gt;len &gt;= s_-&gt;cap) { \
        grow(s_, sizeof(*s_-&gt;data), _Alignof(*s_-&gt;data), a_); \
    } \
    s_-&gt;data + s_-&gt;len++; \
})
</span></code></pre></div></div>

<p>So far this approach to dynamic arrays has been useful on a number of
occasions, and I’m quite happy with the results. As with arena-friendly
hash maps, I’ve no doubt they’ll become a staple in my C programs.</p>

<h3 id="addendum-extend-the-last-allocation">Addendum: extend the last allocation</h3>

<p>Dennis Schön suggests a check if the array ends at the next arena
allocation and, if so, extend the array into the arena in place. <code class="language-plaintext highlighter-rouge">grow()</code>
already has the necessary information on hand, so it needs only the
additional check:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">grow</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">slice</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">align</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="p">{</span>
        <span class="kt">char</span>     <span class="o">*</span><span class="n">data</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
        <span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">replica</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">replica</span><span class="p">,</span> <span class="n">slice</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">replica</span><span class="p">));</span>

    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">replica</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">replica</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">==</span> <span class="n">replica</span><span class="p">.</span><span class="n">data</span> <span class="o">+</span> <span class="n">size</span><span class="o">*</span><span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">alloc</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">size</span><span class="p">,</span> <span class="n">align</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">cap</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">replica</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">size</span><span class="o">*</span><span class="n">replica</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
        <span class="n">replica</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">replica</span><span class="p">.</span><span class="n">cap</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">slice</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">replica</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">replica</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because that’s yet another check for null, I’ve split it out into an
independent third case:</p>

<ol>
  <li>If the data pointer is null, make an initial allocation.</li>
  <li>If the array ends at the next arena allocation, extend it.</li>
  <li>Otherwise allocate a fresh array and copy.</li>
</ol>

<p>Not <em>quite</em> as simple, but it improves the most common case.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>An easy-to-implement, arena-friendly hash map</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/09/30/"/>
    <id>urn:uuid:4a457832-7d23-4dab-80f2-31f683379d7b</id>
    <updated>2023-09-30T23:18:40Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>My last article had <a href="/blog/2023/09/27/">tips for for arena allocation</a>. This next
article demonstrates a technique for building bespoke hash maps that
compose nicely with arena allocation. In addition, they’re fast, simple,
and automatically scale to any problem that could reasonably be solved
with an in-memory hash map. To avoid resizing — both to better support
arenas and to simplify implementation — they have slightly above average
memory requirements. The design, which we’re calling a <em>hash-trie</em>, is the
result of <a href="https://nrk.neocities.org/articles/hash-trees-and-tries">fruitful collaboration with NRK</a>, whose sibling article
includes benchmarks. It’s my new favorite data structure, and has proven
incredibly useful. With a couple well-placed acquire/release atomics, we
can even turn it into a <em>lock-free concurrent hash map</em>.</p>

<p>I’ve written before about <a href="/blog/2022/08/08/">MSI hash tables</a>, a simple, <em>very</em> fast
map that can be quickly implemented from scratch as needed, tailored to
the problem at hand. The trade off is that one must know the upper bound
<em>a priori</em> in order to size the base array. Scaling up requires resizing
the array — an impedance mismatch with arena allocation. Search trees
scale better, as there’s no underlying array, but tree balancing tends to
be finicky and complex, unsuitable to rapid, on-demand implementation.
<strong>We want the ease of an MSI hash table with the scaling of a tree.</strong></p>

<p>I’ll motivate the discussion with example usage. Suppose we have an array
of pointer+length strings, as defined last time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint8_t</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">str</span><span class="p">;</span>
</code></pre></div></div>

<p>And we need a function that removes duplicates in place, but (for the
moment) we’re not worried about preserving order. This could be done
naively in quadratic time. Smarter is to sort, then look for runs.
Instead, I’ve used a hash map to track seen strings. It maps <code class="language-plaintext highlighter-rouge">str</code> to
<code class="language-plaintext highlighter-rouge">bool</code>, and it is represented as type <code class="language-plaintext highlighter-rouge">strmap</code> and one insert+lookup
function, <code class="language-plaintext highlighter-rouge">upsert</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Insert/get bool value for given str key.</span>
<span class="n">bool</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">strmap</span> <span class="o">**</span><span class="p">,</span> <span class="n">str</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">ptrdiff_t</span> <span class="nf">unique</span><span class="p">(</span><span class="n">str</span> <span class="o">*</span><span class="n">strings</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">strmap</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">count</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">bool</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">upsert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">seen</span><span class="p">,</span> <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">b</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// previously seen (discard)</span>
            <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">strings</span><span class="p">[</span><span class="o">--</span><span class="n">len</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="c1">// newly-seen (keep)</span>
            <span class="n">count</span><span class="o">++</span><span class="p">;</span>
            <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In particular, note:</p>

<ul>
  <li>
    <p>A null pointer is an empty hash map and initialization is trivial. As
discussed in the last article, one of my arena allocation principles is
default zero-initializion. Put together, that means any data structure
containing a map comes with a ready-to-use, empty map.</p>
  </li>
  <li>
    <p>The map is allocated out of the scratch arena so it’s automatically
freed upon any return. It’s as care-free as garbage collection.</p>
  </li>
  <li>
    <p>The map directly uses strings in the input array as keys, without making
copies nor worrying about ownership. Arenas own objects, not references.
If I wanted to carve out some fixed keys ahead of time, I could even
insert static strings.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">upsert</code> returns a pointer to a value. That is, a pointer into the map.
This is not strictly required, but usually makes for a simple interface.
When an entry is new, this value will be false (zero-initialized).</p>
  </li>
</ul>

<p>So, what is this wonderful data structure? Here’s the basic shape:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">hashmap</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">keytype</span>  <span class="n">key</span><span class="p">;</span>
    <span class="n">valtype</span>  <span class="n">value</span><span class="p">;</span>
<span class="p">}</span> <span class="n">hashmap</span><span class="p">;</span>
</code></pre></div></div>

<p>They <code class="language-plaintext highlighter-rouge">child</code> and <code class="language-plaintext highlighter-rouge">key</code> fields are essential to the map. Adding a <code class="language-plaintext highlighter-rouge">child</code>
to any data structure turns it into a hash map over whatever field you
choose as the key. In other words, a hash-trie can serve as an <em>intrusive
hash map</em>. In several programs I’ve combined intrusive lists and hash maps
to create an insert-ordered hash map. Going the other direction, omitting
<code class="language-plaintext highlighter-rouge">value</code> turns it into a hash set. (Which is what <code class="language-plaintext highlighter-rouge">unique</code> <em>really</em> needs!)</p>

<p>As you probably guessed, this hash-trie is a 4-ary tree. It can easily be
2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but
4-ary strikes a good balance, if a bit bulky. In the example above,
<code class="language-plaintext highlighter-rouge">keytype</code> would be <code class="language-plaintext highlighter-rouge">str</code> and <code class="language-plaintext highlighter-rouge">valtype</code> would be <code class="language-plaintext highlighter-rouge">bool</code>. The most general
form of <code class="language-plaintext highlighter-rouge">upsert</code> looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">valtype</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">hashmap</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">keytype</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">perm</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">hashmap</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This will take some unpacking. The first argument is a pointer to a
pointer. That’s the destination for any newly-allocated element. As it
travels down the tree, this points into the parent’s <code class="language-plaintext highlighter-rouge">child</code> array. If
it points to null, then it’s an empty tree which, by definition, does not
contain the key.</p>

<p>We need two “methods” for keys: <code class="language-plaintext highlighter-rouge">hash</code> and <code class="language-plaintext highlighter-rouge">equals</code>. The hash function
should return a uniformly distributed integer. As is usually the case,
less uniform fast hashes generally do better than highly-uniform slow
hashes. For hash maps under ~100K elements a 32-bit hash is fine, but
larger maps should use a 64-bit hash state and result. Hash collisions
revert to linear, linked list performance and, per the birthday paradox,
that will happen often with 32-bit hashes on large hash maps.</p>

<p>If you’re worried about pathological inputs, add a seed parameter to
<code class="language-plaintext highlighter-rouge">upsert</code> and <code class="language-plaintext highlighter-rouge">hash</code>. Or maybe even use the address <code class="language-plaintext highlighter-rouge">m</code> as a seed. The
specifics depend on your security model. It’s not an issue for most hash
maps, so I don’t demonstrate it here.</p>

<p>The top two bits of the hash are used to select a branch. These tend to be
higher quality for <a href="/blog/2018/07/31/">multiplicative hash functions</a>. At each level
two bits are shifted out. This is what gives it its name: a <em>trie</em> of the
<em>hash bits</em>. Though it’s un-trie-like in the way it deposits elements at
the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a
time.</p>

<p>I initially tried a <a href="/blog/2019/11/19/">Multiplicative Congruential Generator</a> (MCG) to
select the next branch at each trie level, instead of bit shifting, but
NRK noticed it was consistently slower than shifting.</p>

<p>While “delete” could be handled using gravestones, many deletes would not
work well. After all, the underlying allocator is an arena. A combination
of uniformly distributed branching and no deletion means that rebalancing
is unnecessary. This is what grants it its simplicity!</p>

<p>If no arena is provided, it reverts to a lookup and returns null when the
key is not found. It allows one function to flexibly serve both modes. In
<code class="language-plaintext highlighter-rouge">unique</code>, pure lookups are unneeded, so this condition could be skipped in
its <code class="language-plaintext highlighter-rouge">strmap</code>.</p>

<p>Sometimes it’s useful to return the entire <code class="language-plaintext highlighter-rouge">hashmap</code> object itself rather
than an internal pointer, particularly when it’s intrusive. Use whichever
works best for the situation. Regardless, exploit zero-initialization to
detect newly-allocated elements when possible.</p>

<p>In some cases we may deep copy the key in its arena before inserting it
into the map. The provided key may be a temporary (e.g. <code class="language-plaintext highlighter-rouge">sprintf</code>) which
the map outlives, and the caller doesn’t want to allocate a longer-lived
key unless it’s needed. It’s all part of tailoring the map to the problem,
which we can do because it’s so short and simple!</p>

<h3 id="fleshing-it-out">Fleshing it out</h3>

<p>Putting it all together, <code class="language-plaintext highlighter-rouge">unique</code> could look like the following, with
<code class="language-plaintext highlighter-rouge">strmap</code>/<code class="language-plaintext highlighter-rouge">upsert</code> renamed to <code class="language-plaintext highlighter-rouge">strset</code>/<code class="language-plaintext highlighter-rouge">ismember</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash</span><span class="p">(</span><span class="n">str</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mh">0x100</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">s</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111u</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>

<span class="n">bool</span> <span class="nf">equals</span><span class="p">(</span><span class="n">str</span> <span class="n">a</span><span class="p">,</span> <span class="n">str</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="o">==</span><span class="n">b</span><span class="p">.</span><span class="n">len</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">memcmp</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="n">len</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">strset</span> <span class="o">*</span><span class="n">child</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">str</span>     <span class="n">key</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strset</span><span class="p">;</span>

<span class="n">bool</span> <span class="nf">ismember</span><span class="p">(</span><span class="n">strset</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">str</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="o">*</span><span class="n">m</span><span class="p">;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">child</span><span class="p">[</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="o">*</span><span class="n">m</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">strset</span><span class="p">);</span>
    <span class="p">(</span><span class="o">*</span><span class="n">m</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">ptrdiff_t</span> <span class="nf">unique</span><span class="p">(</span><span class="n">str</span> <span class="o">*</span><span class="n">strings</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">strset</span> <span class="o">*</span><span class="n">seen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">count</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">ismember</span><span class="p">(</span><span class="o">&amp;</span><span class="n">seen</span><span class="p">,</span> <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">scratch</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">strings</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">strings</span><span class="p">[</span><span class="o">--</span><span class="n">len</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">count</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with
an xorshift finalizer because the bits are used most-significant first.
Exercise for the reader: Support retaining the original input order using
an intrusive linked list on <code class="language-plaintext highlighter-rouge">strset</code>.</p>

<h3 id="relative-pointers">Relative pointers?</h3>

<p>As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes
these hash-tries a bit heavier than average. It’s not an issue for smaller
hash maps, but has practical consequences for huge hash maps.</p>

<p>In attempt to address this, I experimented with <a href="https://www.youtube.com/watch?v=Z0tsNFZLxSU">relative pointers</a>
(example: <a href="https://github.com/skeeto/scratch/blob/master/misc/markov.c"><code class="language-plaintext highlighter-rouge">markov.c</code></a>). That is, instead of pointers I use signed
integers whose value indicates an offset <em>relative to itself</em>. Because
relative pointers can only refer to nearby memory, a custom allocator is
imperative, and arenas fit the bill perfectly. Range can be extended by
exploiting memory alignment. In particular, 32-bit relative pointers can
reference up to 8GiB in either direction. Zero is reserved to represent a
null pointer, and relative pointers cannot refer to themselves.</p>

<p>As a bonus, data structures built out of relative pointers are <em>position
independent</em>. A collection of them — perhaps even a whole arena — can be
dumped out to, say, a file, loaded back at a different position, then
continue to operate as-is. Very cool stuff.</p>

<p>Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead
in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are
guaranteed to work. No object is ever too far away. It’s a compounding
effect, too. Smaller map nodes means a larger number of them are in reach
of a relative pointer. Also very cool.</p>

<p>However, as far as I know, no generally available programming language
implementation supports this concept well enough to put into practice. You
could implement relative pointers with language extension facilities, such
as C++ operator overloads, but <em>no tools will understand them</em> — a major
bummer. You can no longer use a debugger to examine such structures, and
it’s just not worth that cost. If only arena allocation was more popular…</p>

<h3 id="as-a-concurrent-hash-map">As a concurrent hash map</h3>

<p>For the finale, let’s convert <code class="language-plaintext highlighter-rouge">upsert</code> into a concurrent, lock-free hash
map. That is, multiple threads can call upsert concurrently on the same
map. Each must still have its own arena, probably per-thread arenas, and
so no implicit locking for allocation.</p>

<p>The structure itself requires no changes! Instead we need two atomic
operations: atomic load (acquire), and atomic compare-and-exchange
(acquire/release). They operate only on <code class="language-plaintext highlighter-rouge">child</code> array elements and the
tree root. To illustrate I will use <a href="https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html">GCC atomics</a>, also supported by
Clang.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">valtype</span> <span class="o">*</span><span class="nf">upsert</span><span class="p">(</span><span class="n">map</span> <span class="o">**</span><span class="n">m</span><span class="p">,</span> <span class="n">keytype</span> <span class="n">key</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">);;</span> <span class="n">h</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">map</span> <span class="o">*</span><span class="n">n</span> <span class="o">=</span> <span class="n">__atomic_load_n</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">n</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">perm</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="n">arena</span> <span class="n">rollback</span> <span class="o">=</span> <span class="o">*</span><span class="n">perm</span><span class="p">;</span>
            <span class="n">map</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">map</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
            <span class="n">new</span><span class="o">-&gt;</span><span class="n">key</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="kt">int</span> <span class="n">pass</span> <span class="o">=</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">;</span>
            <span class="kt">int</span> <span class="n">fail</span> <span class="o">=</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">__atomic_compare_exchange_n</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">n</span><span class="p">,</span> <span class="n">new</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pass</span><span class="p">,</span> <span class="n">fail</span><span class="p">))</span> <span class="p">{</span>
                <span class="k">return</span> <span class="o">&amp;</span><span class="n">new</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="o">*</span><span class="n">perm</span> <span class="o">=</span> <span class="n">rollback</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">n</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">n</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">m</span> <span class="o">=</span> <span class="n">n</span><span class="o">-&gt;</span><span class="n">child</span> <span class="o">+</span> <span class="p">(</span><span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">62</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First an atomic load retrieves the current node. If there is no such node,
then attempt to insert one using atomic compare-and-exchange. The <a href="/blog/2014/09/02/">ABA
problem</a> is not an issue thanks again to lack of deletion: Once set,
a pointer never changes. Before allocating a node, take a snapshot of the
arena so that the allocation can be reverted on failure. If another thread
got there first, continue tumbling down the tree <em>as though a null was
never observed</em>.</p>

<p>On compare-and-swap failure, it turns into an acquire load, just as it
began. On success, it’s a release store, synchronizing with acquire loads
on other threads.</p>

<p>The <code class="language-plaintext highlighter-rouge">key</code> field does not require atomics because it’s synchronized by the
compare-and-swap. That is, the assignment will happen before the node is
inserted, and keys do not change after insertion. The same goes for any
zeroing done by the arena.</p>

<p><strong>Loads and stores through the returned pointer are the caller’s
responsibility.</strong> These likely require further synchronization. If
<code class="language-plaintext highlighter-rouge">valtype</code> is a shared counter then an atomic increment is sufficient. In
other cases, <code class="language-plaintext highlighter-rouge">upsert</code> should probably be modified to accept an initial
value to be assigned alongside the key so that the entire key/value pair
inserted atomically. Alternatively, <a href="/blog/2022/05/14/">break it into two steps</a>. The
details depend on the needs of the program.</p>

<p>On small trees there will much contention near the root of the tree during
inserts. Fortunately, a contentious tree will not stay small for long! The
hash function will spread threads around a large tree, generally keeping
them off each other’s toes.</p>

<p>A complete demo you can try yourself: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/concurrent-hash-trie.c"><code class="language-plaintext highlighter-rouge">concurrent-hash-trie.c</code></a></strong>.
It returns a value pointer like above, and store/load is synchronized by
the thread join. Each thread is given a per-thread subarena allocated out
of the main arena, and the final tree is built from these subarenas.</p>

<p>For a practical example: a <a href="https://github.com/skeeto/scratch/blob/master/misc/rainbow.c"><strong>multithreaded rainbow table</strong></a> to find
hash function collisions. Threads are synchronized solely through atomics
in the shared hash-trie.</p>

<p>A complete fast, concurrent, lock-free hash map in under 30 lines of C
sounds like a sweet deal to me!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Arena allocator tips and tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/09/27/"/>
    <id>urn:uuid:46b2ee54-9169-4070-ad5d-aa0e2700a65e</id>
    <updated>2023-09-27T03:58:59Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=37670740">on Hacker News</a>.</em></p>

<p>Over the past year I’ve refined my approach to <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">arena allocation</a>.
With practice, it’s effective, simple, and fast; typically as easy to use
as garbage collection but without the costs. Depending on need, an
allocator can weigh just 7–25 lines of code — perfect when <a href="/blog/2023/02/15/">lacking a
runtime</a>. With the core details of my own technique settled, now is a
good time to document and share lessons learned. This is certainly not the
only way to approach arena allocation, but these are practices I’ve worked
out to simplify programs and reduce mistakes.</p>

<!--more-->

<p>An arena is a memory buffer and an offset into that buffer, initially
zero. To allocate an object, grab a pointer at the offset, advance the
offset by the size of the object, and return the pointer. There’s a little
more to it, such as ensuring alignment and availability. We’ll get to
that. Objects are not freed individually. Instead, groups of allocations
are freed at once by restoring the offset to an earlier value. Without
individual lifetimes, you don’t need to write destructors, nor do your
programs need to walk data structures at run time to take them apart. You
also no longer need to worry about memory leaks.</p>

<p>A minority of programs inherently require general purpose allocation, at
least in part, that linear allocation cannot fulfill. This includes, for
example, most programming language runtimes. If you like arenas, avoid
accidentally create such a situation through an over-flexible API that
allows callers to assume you have general purpose allocation underneath.</p>

<p>To get warmed up, here’s my style of arena allocation in action that shows
off multiple features:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">uint8_t</span>  <span class="o">*</span><span class="n">data</span><span class="p">;</span>
    <span class="kt">ptrdiff_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">str</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">strlist</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="n">str</span>      <span class="n">item</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strlist</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">str</span> <span class="n">head</span><span class="p">;</span>
    <span class="n">str</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span> <span class="n">strpair</span><span class="p">;</span>

<span class="c1">// Defined elsewhere</span>
<span class="kt">void</span>    <span class="nf">towidechar</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">ptrdiff_t</span><span class="p">,</span> <span class="n">str</span><span class="p">);</span>
<span class="n">str</span>     <span class="nf">loadfile</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="p">);</span>
<span class="n">strpair</span> <span class="nf">cut</span><span class="p">(</span><span class="n">str</span><span class="p">,</span> <span class="kt">uint8_t</span><span class="p">);</span>

<span class="n">strlist</span> <span class="o">*</span><span class="nf">getlines</span><span class="p">(</span><span class="n">str</span> <span class="n">path</span><span class="p">,</span> <span class="n">arena</span> <span class="o">*</span><span class="n">perm</span><span class="p">,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">max_path</span> <span class="o">=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">15</span><span class="p">;</span>
    <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">wpath</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="o">&amp;</span><span class="n">scratch</span><span class="p">,</span> <span class="kt">wchar_t</span><span class="p">,</span> <span class="n">max_path</span><span class="p">);</span>
    <span class="n">towidechar</span><span class="p">(</span><span class="n">wpath</span><span class="p">,</span> <span class="n">max_path</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>

    <span class="n">strpair</span> <span class="n">pair</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">pair</span><span class="p">.</span><span class="n">tail</span> <span class="o">=</span> <span class="n">loadfile</span><span class="p">(</span><span class="n">wpath</span><span class="p">,</span> <span class="n">perm</span><span class="p">);</span>

    <span class="n">strlist</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">strlist</span> <span class="o">**</span><span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">head</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">.</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pair</span> <span class="o">=</span> <span class="n">cut</span><span class="p">(</span><span class="n">pair</span><span class="p">.</span><span class="n">tail</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">strlist</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">(</span><span class="o">*</span><span class="n">tail</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">item</span> <span class="o">=</span> <span class="n">pair</span><span class="p">.</span><span class="n">head</span><span class="p">;</span>
        <span class="n">tail</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">tail</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Take note of these details, each to be later discussed in detail:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">getlines</code> takes two arenas, “permanent” and “scratch”. The former is
for objects that will be returned to the caller. The latter is for
temporary objects whose lifetime ends when the function returns. They
have stack lifetimes just like local variables.</p>
  </li>
  <li>
    <p>Objects are not explicitly freed. Instead, <strong>all allocations from a
scratch arena are implicitly freed upon return</strong>. This would include
error return paths automatically.</p>
  </li>
  <li>
    <p>The <strong>scratch arena is passed by copy</strong> — i.e. a copy of the “header”
not the <em>memory region</em> itself. Allocating only changes the local copy,
and so cannot survive the return. The semantics are obvious to callers,
so they’re less likely to get mixed up.</p>
  </li>
  <li>
    <p>While <code class="language-plaintext highlighter-rouge">wpath</code> could be an automatic local variable, it’s relatively
large for the stack, so it’s allocated out of the scratch arena. A
scratch arena safely permits large, dynamic allocations that would never
be safe on the stack. In other words, <strong>a sane <a href="https://man7.org/linux/man-pages/man3/alloca.3.html"><code class="language-plaintext highlighter-rouge">alloca</code></a>!</strong>
Same for variable-length arrays (VLAs). A scratch arena means you’ll
never be tempted to use either of these terrible ideas.</p>
  </li>
  <li>
    <p>The second parameter to <code class="language-plaintext highlighter-rouge">new</code> is a type, so it’s obviously a macro. As
you will see momentarily, this is not some complex macro magic, just a
convenience one-liner. There is no implicit cast, and you will get a
compiler diagnostic if the type is incorrect.</p>
  </li>
  <li>
    <p>Despite all the allocation, there is not a single <code class="language-plaintext highlighter-rouge">sizeof</code> operator nor
size computation. That’s because <strong>size computations are a major source
of defects.</strong> That job is handled by specialized code.</p>
  </li>
  <li>
    <p><strong>Allocation failures are not communicated by a null return</strong>. Lifting
this burden greatly simplifies programs. Instead such errors are handled
non-locally by the arena.</p>
  </li>
  <li>
    <p>All allocations are <strong>zero-initialized by default</strong>. This makes for
simpler, less error-prone programs. When that’s too expensive, this can
become an opt-out without changing the default.</p>
  </li>
</ul>

<p>See also <a href="/blog/2023/01/18/">u-config</a>.</p>

<h3 id="an-arena-implementation">An arena implementation</h3>

<p>An arena suitable for most cases can be this simple:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">align</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">ptrdiff_t</span> <span class="n">padding</span> <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">ptrdiff_t</span> <span class="n">available</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">end</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">-</span> <span class="n">padding</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">available</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">count</span> <span class="o">&gt;</span> <span class="n">available</span><span class="o">/</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">abort</span><span class="p">();</span>  <span class="c1">// one possible out-of-memory policy</span>
    <span class="p">}</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+</span> <span class="n">padding</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">beg</span> <span class="o">+=</span> <span class="n">padding</span> <span class="o">+</span> <span class="n">count</span><span class="o">*</span><span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">count</span><span class="o">*</span><span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Yup, just a pair of pointers! When allocating, all sizes are signed <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">just
as they ought to be</a>. Unsigned sizes are another historically
common source of defects, and offer no practical advantages in return.</p>

<p>The <code class="language-plaintext highlighter-rouge">align</code> parameter allows the arena to handle any unusual alignments,
something that’s surprisingly difficult to do with libc. It’s difficult to
appreciate its usefulness until it’s convenient.</p>

<p>The <code class="language-plaintext highlighter-rouge">uintptr_t</code> business may look unusual if you’ve never come across it
before. To align <code class="language-plaintext highlighter-rouge">beg</code>, we need to compute the number of bytes to advance
the address (<code class="language-plaintext highlighter-rouge">padding</code>) until the alignment evenly divides the address.
The modulo with <code class="language-plaintext highlighter-rouge">align</code> computes the number of bytes it’s since the last
alignment:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>extra = addr % align
</code></pre></div></div>

<p>We can’t operate numerically on an address like this, so in the code we
first convert to <code class="language-plaintext highlighter-rouge">uintptr_t</code>. Alignment is always a power of two, which
notably excludes zero, so no worrying about division by zero. That also
means we can compute modulo by subtracting one and masking with AND:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>extra = addr &amp; (align - 1)
</code></pre></div></div>

<p>However, we want the number of bytes to advance to the next alignment,
which is the inverse:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>padding = -addr &amp; (align - 1)
</code></pre></div></div>

<p>Add the <code class="language-plaintext highlighter-rouge">uintptr_t</code> cast and you have the code in <code class="language-plaintext highlighter-rouge">alloc</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">if</code> tests if there’s enough memory and simultaneously for overflow on
<code class="language-plaintext highlighter-rouge">size*count</code>. If either fails, it invokes the out-of-memory policy, which
in this case is <code class="language-plaintext highlighter-rouge">abort</code>. I strongly recommend that, at least when testing,
always having <em>something</em> in place to, at minimum, abort when allocation
fails, even when you think it cannot happen. It’s easy to use more memory
than you anticipate, and you want a reliable signal when it happens.</p>

<p>An alternative policy is to <a href="/blog/2023/02/12/">longjmp to a “handler”</a>, which with
GCC and Clang doesn’t even require runtime support. In that case add a
<code class="language-plaintext highlighter-rouge">jmp_buf</code> to the arena:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">char</span>  <span class="o">*</span><span class="n">beg</span><span class="p">;</span>
    <span class="kt">char</span>  <span class="o">*</span><span class="n">end</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">**</span><span class="kt">jmp_buf</span><span class="p">;</span>
<span class="p">}</span> <span class="n">arena</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(...)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* out of memory */</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">__builtin_longjmp</span><span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="n">bool</span> <span class="nf">example</span><span class="p">(...,</span> <span class="n">arena</span> <span class="n">scratch</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">5</span><span class="p">];</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">__builtin_setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">scratch</span><span class="p">.</span><span class="kt">jmp_buf</span> <span class="o">=</span> <span class="kt">jmp_buf</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">example</code> returns failure to the caller if it runs out of memory, without
needing to check individual allocations and, thanks to the implicit free
of scratch arenas, without needing to clean up. If callees receiving the
scratch arena don’t set their own <code class="language-plaintext highlighter-rouge">jmp_buf</code>, they’ll return here, too. In
a real program you’d probably wrap the <code class="language-plaintext highlighter-rouge">setjmp</code> setup in a macro.</p>

<p>Suppose zeroing is too expensive or unnecessary in some cases. Add a flag
to opt out:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(...,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="k">return</span> <span class="n">flag</span><span class="o">&amp;</span><span class="n">NOZERO</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">total</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Similarly, perhaps there’s a critical moment where you’re holding a
non-memory resource (lock, file handle), or you don’t want allocation
failure to be fatal. In either case, it’s important that the out-of-memory
policy isn’t invoked. You could request a “soft” failure with another
flag, and then do the usual null pointer check:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(...,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* out of memory */</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">SOFTFAIL</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Most non-trivial programs will probably have at least one of these flags.</p>

<p>In case it wasn’t obvious, allocating an arena is simple:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">arena</span> <span class="nf">newarena</span><span class="p">(</span><span class="kt">ptrdiff_t</span> <span class="n">cap</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">arena</span> <span class="n">a</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">cap</span><span class="p">);</span>
    <span class="n">a</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">beg</span> <span class="o">?</span> <span class="n">a</span><span class="p">.</span><span class="n">beg</span><span class="o">+</span><span class="n">cap</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or make a direct allocation from the operating system, e.g. <code class="language-plaintext highlighter-rouge">mmap</code>,
<code class="language-plaintext highlighter-rouge">VirtualAlloc</code>. Typically arena lifetime is the whole program, so you
don’t need to worry about freeing it. (Since you’re using arenas, you can
also turn off any memory leak checkers while you’re at it.)</p>

<p>If you need more arenas then you can always allocate smaller ones out of
the first! In multi-threaded applications, each thread may have at least
its own scratch arena.</p>

<h3 id="the-new-macro">The <code class="language-plaintext highlighter-rouge">new</code> macro</h3>

<p>I’ve shown <code class="language-plaintext highlighter-rouge">alloc</code>, but few parts of the program should be calling it
directly. Instead they have a macro to automatically handle the details. I
call mine <code class="language-plaintext highlighter-rouge">new</code>, though of course if you’re writing C++ you’ll need to
pick another name (<code class="language-plaintext highlighter-rouge">make</code>? <code class="language-plaintext highlighter-rouge">PushStruct</code>?):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)
</span></code></pre></div></div>

<p>The cast is an extra compile-time check, especially useful for avoiding
mistakes in levels of indirection. It also keeps normal code from directly
using the <code class="language-plaintext highlighter-rouge">sizeof</code> operator, which is easy to misuse. If you added a
<code class="language-plaintext highlighter-rouge">flags</code> parameter, pass in zero for this common case. Keep in mind that
the goal of this macro is to make common allocation simple and robust.</p>

<p>Often you’ll allocate single objects, and so the count is 1. If you think
that’s ugly, you could make variadic version of <code class="language-plaintext highlighter-rouge">new</code> that fills in common
defaults. In fact, that’s partly why I put <code class="language-plaintext highlighter-rouge">count</code> last!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define new(...)            newx(__VA_ARGS__,new4,new3,new2)(__VA_ARGS__)
#define newx(a,b,c,d,e,...) e
#define new2(a, t)          (t *)alloc(a, sizeof(t), alignof(t), 1, 0)
#define new3(a, t, n)       (t *)alloc(a, sizeof(t), alignof(t), n, 0)
#define new4(a, t, n, f)    (t *)alloc(a, sizeof(t), alignof(t), n, f)
</span></code></pre></div></div>

<p>Not quite so simple, but it optionally makes for more streamlined code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">thing</span> <span class="o">*</span><span class="n">t</span>   <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">thing</span><span class="p">);</span>
<span class="n">thing</span> <span class="o">*</span><span class="n">ts</span>  <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="n">thing</span><span class="p">,</span> <span class="mi">1000</span><span class="p">);</span>
<span class="kt">char</span>  <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">new</span><span class="p">(</span><span class="n">perm</span><span class="p">,</span> <span class="kt">char</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">NOZERO</span><span class="p">);</span>
</code></pre></div></div>

<p>Side note: If <code class="language-plaintext highlighter-rouge">sizeof</code> should be avoided, what about array lengths? That’s
part of the problem! Hardly ever do you want the <em>size</em> of an array, but
rather the <em>number of elements</em>. That includes <code class="language-plaintext highlighter-rouge">char</code> arrays where this
happens to be the same number. So instead, define a <code class="language-plaintext highlighter-rouge">countof</code> macro that
uses <code class="language-plaintext highlighter-rouge">sizeof</code> to compute the value you actually want. I like to have this
whole collection:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define sizeof(x)    (ptrdiff_t)sizeof(x)
#define countof(a)   (sizeof(a) / sizeof(*(a)))
#define lengthof(s)  (countof(s) - 1)
</span></code></pre></div></div>

<p>Yes, you can convert <code class="language-plaintext highlighter-rouge">sizeof</code> into a macro like this! It won’t expand
recursively and bottoms out as an operator. <code class="language-plaintext highlighter-rouge">countof</code> also, of course,
produces a less error-prone signed count so users don’t fumble around with
<code class="language-plaintext highlighter-rouge">size_t</code>. <code class="language-plaintext highlighter-rouge">lengthof</code> statically produces null-terminated string length.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"hello world"</span><span class="p">;</span>
<span class="n">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">lengthof</span><span class="p">(</span><span class="n">msg</span><span class="p">));</span>

<span class="cp">#define MSG "hello world"
</span><span class="n">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">MSG</span><span class="p">,</span> <span class="n">lengthof</span><span class="p">(</span><span class="n">MSG</span><span class="p">));</span>
</code></pre></div></div>

<h3 id="enhance-alloc-with-attributes">Enhance <code class="language-plaintext highlighter-rouge">alloc</code> with attributes</h3>

<p>At least for GCC and Clang, we can further improve <code class="language-plaintext highlighter-rouge">alloc</code> with three
function attributes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">malloc</span><span class="p">,</span> <span class="n">alloc_size</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">alloc_align</span><span class="p">(</span><span class="mi">3</span><span class="p">)))</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(...);</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">malloc</code> indicates that the pointer returned by <code class="language-plaintext highlighter-rouge">alloc</code> does not alias any
existing object. Enables some significant optimizations that are otherwise
blocked, most often by breaking potential loop-carried dependencies.</p>

<p><code class="language-plaintext highlighter-rouge">alloc_size</code> tracks the allocation size for compile-time diagnostics and
run-time assertions (<a href="https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html"><code class="language-plaintext highlighter-rouge">__builtin_object_size</code></a>). This generally
requires a non-zero optimization level. In other words, you will get a
compiler warnings about some out bounds accesses of arena objects, and
with Undefined Behavior Sanitizer you’ll get run-time bounds checking.
It’s a great <a href="/blog/2019/01/25/">complement to fuzzing</a>.</p>

<p><strong>Update June 2024</strong>: I’ve learned that <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cane2ee7fpnyn3qxslygprmjw2yrvzppxuim25jvf7e6f5jgxbd@p7y6own2j3it%3E"><code class="language-plaintext highlighter-rouge">alloc_size</code> is fundamentally
broken</a> since its <a href="https://gcc.gnu.org/gcc-4.3/changes.html">introduction in GCC 4.3.0 (March 2008)</a>.
Correct use is impossible, and existing instances all rely on luck. In
certain cases, such as function inlining, the pointer information is lost,
and GCC may generate invalid code based on stale data.</p>

<p>In theory <code class="language-plaintext highlighter-rouge">alloc_align</code> may also allow better code generation, but I’ve
yet to observe a case. Consider it optional and low-priority. I mention it
only for completeness.</p>

<h3 id="arena-size-and-growth">Arena size and growth</h3>

<p>How large an arena should you allocate? The simple answer: As much as is
necessary for the program to successfully complete. Usually the cost of
untouched arena memory is low or even zero. Most programs should probably
have an upper limit, at which point they assume something has gone wrong.
Arenas allow this case to be handled gracefully, simplifying recovery and
paving the way for continued operation.</p>

<p>While a sufficient answer for most cases, it’s unsatisfying. There’s a
common assumption that programs should increase their memory usage as much
as needed and let the operating system respond if it’s too much. However,
if you’ve ever tried this yourself, you probably noticed that mainstream
operating systems don’t handle it well. The typical results are system
instability — thrashing, drivers crashing — possibly necessitating a
reboot.</p>

<p>If you insist on this route, on 64-bit hosts you can reserve a gigantic
virtual address space and gradually commit memory as needed. On Linux that
means leaning on overcommit by allocating the largest arena possible at
startup, which will automatically commit through use. <a href="/blog/2019/12/29/">Use <code class="language-plaintext highlighter-rouge">MADV_FREE</code> to
decommit.</a></p>

<p>On Windows, <code class="language-plaintext highlighter-rouge">VirtualAlloc</code> handles reserve and commit separately. In
addition to the allocation offset, you need a commit offset. Then expand
the committed region ahead of the allocation offset as it grows. If you
ever manually reset the allocation offset, you could decommit as well, or
at least <code class="language-plaintext highlighter-rouge">MEM_RESET</code>. At some point commit may fail, which should then
trigger the out-of-memory policy, but the system is probably in poor shape
by that point — i.e. use an abort policy to release it all quickly.</p>

<h3 id="pointer-laundering-filthy-hack">Pointer laundering (filthy hack)</h3>

<p>While allocations out of an arena don’t require individual error checks,
allocating the arena itself at startup requires error handling. It would
be nice if the arena could be allocated out of <code class="language-plaintext highlighter-rouge">.bss</code> and punt that job to
the loader. While you <em>could</em> make a big, global <code class="language-plaintext highlighter-rouge">char[]</code> array to back
your arena, it’s technically not permitted (strict aliasing). A “clean”
<code class="language-plaintext highlighter-rouge">.bss</code> region could be obtained with a bit of assembly — <a href="https://sourceware.org/binutils/docs/as/Comm.html"><code class="language-plaintext highlighter-rouge">.comm</code></a>
plus assembly to get the address into C without involving an array. I
wanted a more portable solution, so I came up with this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">arena</span> <span class="nf">getarena</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">char</span> <span class="n">mem</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">28</span><span class="p">];</span>
    <span class="n">arena</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">r</span><span class="p">.</span><span class="n">beg</span> <span class="o">=</span> <span class="n">mem</span><span class="p">;</span>
    <span class="n">asm</span> <span class="p">(</span><span class="s">""</span> <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">beg</span><span class="p">));</span>  <span class="c1">// launder the pointer</span>
    <span class="n">r</span><span class="p">.</span><span class="n">end</span> <span class="o">=</span> <span class="n">r</span><span class="p">.</span><span class="n">beg</span> <span class="o">+</span> <span class="n">countof</span><span class="p">(</span><span class="n">mem</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">asm</code> accepts a pointer and returns a pointer (<code class="language-plaintext highlighter-rouge">"+r"</code>). The compiler
cannot “see” that it’s actually empty, and so returns the same pointer.
The arena will be backed by <code class="language-plaintext highlighter-rouge">mem</code>, but by laundering the address through
<code class="language-plaintext highlighter-rouge">asm</code>, I’ve disconnected the pointer from its origin. As far the compiler
is concerned, this is some foreign, assembly-provided pointer, not a
pointer into <code class="language-plaintext highlighter-rouge">mem</code>. It can’t optimize away <code class="language-plaintext highlighter-rouge">mem</code> because it’s been given
to a mysterious assembly black box.</p>

<p>While inappropriate for a real project, I think it’s a neat trick.</p>

<h3 id="arena-friendly-container-data-structures">Arena-friendly container data structures</h3>

<p>In my initial example I used a linked list to stores lines. This data
structure is great with arenas. It only takes a few of lines of code to
implement a linked list on top of an arena, and no “destroy” code is
needed. Simple.</p>

<p>What about <a href="/blog/2023/09/30/">arena-backed associative arrays</a>? Or <a href="/blog/2023/10/05/">arena-backed
dynamic arrays</a>? See these follow-up articles for details!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to link identical function names from different DLLs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/08/27/"/>
    <id>urn:uuid:265f121d-9418-4eb6-929f-a125264d0f2a</id>
    <updated>2023-08-27T01:46:31Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>For the typical DLL function call you declare the function prototype (via
header file), you inform the link editor (<code class="language-plaintext highlighter-rouge">ld</code>, <code class="language-plaintext highlighter-rouge">link</code>) that the DLL
exports a symbol with that name (import library), it matches the declared
name with this export, and it becomes an import in your program’s import
table. What happens when two different DLLs export the same symbol? The
link editor will pick the first found. But what if you want to use <em>both</em>
exports? If they have the same name, how could program or link editor
distinguish them? In this article I’ll demonstrate a technique to resolve
this by creating a program which links with and directly uses two
different C runtimes (CRTs) simultaneously.</p>

<p>In <a href="https://learn.microsoft.com/en-us/windows/win32/debug/pe-format">PE executable images</a>, an import isn’t just a symbol, but a tuple
of DLL name and symbol. For human display, a tuple is typically formatted
with an exclamation point delimiter, as in <code class="language-plaintext highlighter-rouge">msvcrt.dll!malloc</code>, though
sometimes without the <code class="language-plaintext highlighter-rouge">.dll</code> suffix. You’ve likely seen this in stack
traces. Because it’s a tuple and not just a symbol, it’s possible to refer
to, and import, the same symbol from different DLLs. Contrast that with
ELF, which has a list of shared objects, and a separate list of symbols,
with the dynamic linker pairing them up at load time. That permits cool
tricks like <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code>, but for the same reason loading is less
predictable.</p>

<p>Windows comes with several CRTs, and various libraries and applications
use one or another (<a href="/blog/2023/02/15/">or none</a>) depending on how they were built. As
C standard library implementations they export mostly the same symbols,
<code class="language-plaintext highlighter-rouge">malloc</code>, <code class="language-plaintext highlighter-rouge">printf</code>, etc. With imports as tuples, it’s not so unusual for
an application to load multiple CRTs at once. Typically coexistence is
transitive. That is, a module does not directly access both CRTs but
depends on modules that use different CRTs. One module calls, say,
<code class="language-plaintext highlighter-rouge">msvcrt.dll!malloc</code>, and another module calls <code class="language-plaintext highlighter-rouge">ucrtbase.dll!malloc</code>. With
DLL-qualified symbols, this is sound so long as modules don’t cross the
streams, e.g. an allocation in one module must not be freed in the other.
Libraries in this ecosystem must avoid exposing their CRT through their
interfaces, such as expecting the library’s caller to <code class="language-plaintext highlighter-rouge">free()</code> objects:
The caller might not have access to the right <code class="language-plaintext highlighter-rouge">free</code>!</p>

<p>Contrast again with the unix ecosystem generally, where a process can only
load one libc and everyone is expected to share. Libraries commonly expect
callers to <code class="language-plaintext highlighter-rouge">free()</code> their objects (e.g. <a href="https://tiswww.case.edu/php/chet/readline/readline.html#Basic-Behavior">libreadline</a>, <a href="https://man.archlinux.org/man/xcb-requests.3.en">xcb</a>),
blending their interface with libc.</p>

<p>Suppose you’re in such a situation where, due to unix-oriented libraries,
your application must use functions from two different CRTs at once. One
might have been compiled with Mingw-w64 and linked with MSVCRT, and the
other compiled with MSVC and linked with UCRT. We need to call <code class="language-plaintext highlighter-rouge">malloc</code>
and <code class="language-plaintext highlighter-rouge">free</code> in each, but they have the same name. What a pickle!</p>

<p>There’s an obvious, and probably most common, solution: <a href="https://learn.microsoft.com/en-us/windows/win32/dlls/run-time-dynamic-linking">run-time dynamic
linking</a>. Use load-time linking on one CRT, and LoadLibrary on the
other CRT with GetProcAddress to obtain function pointers. However, it’s
possible to do this entirely with load-time linking!</p>

<h3 id="a-malloc-by-any-other-name-would-allocate-as-well">A malloc by any other name would allocate as well</h3>

<p>Think about it a moment and you might wonder: If the names are the same,
how can I pick which I’m calling? The tuple representation won’t work
because <code class="language-plaintext highlighter-rouge">!</code> cannot appear in an identifier, which is, after all, why it
was chosen. The trick is that we’re going to <em>rename</em> one of them! To
demonstrate, I’ll use <a href="/blog/2020/09/25/">my Windows development kit</a>, <a href="https://github.com/skeeto/w64devkit">w64devkit</a>, a
Mingw-w64 distribution that links MSVCRT. I’m going to use UCRT as the
second CRT to access <code class="language-plaintext highlighter-rouge">ucrtbase.dll!malloc</code>.</p>

<p>I can choose whatever valid identifier I’d like, so I’m going to pick
<code class="language-plaintext highlighter-rouge">ucrt_malloc</code>. This will <a href="/blog/2021/05/31/">require a declaration</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">ucrt_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>If I stop here and try to use it, of course it won’t work:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ld: undefined reference to `__imp_ucrt_malloc'
</code></pre></div></div>

<p>The linker hasn’t yet been informed of the change in management. For that
we’ll need an import library. I’ll define one using a <a href="https://sourceware.org/binutils/docs/binutils/def-file-format.html">.def file</a>,
which I’ll name <code class="language-plaintext highlighter-rouge">ucrtbase.def</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIBRARY ucrtbase.dll
EXPORTS
ucrt_malloc == malloc
</code></pre></div></div>

<p>The last line says that this library has the symbol <code class="language-plaintext highlighter-rouge">ucrt_malloc</code>, but
that it should be imported as <code class="language-plaintext highlighter-rouge">malloc</code>. This line is the lynchpin to the
whole scheme. Note: The double equals is important, as a single equals
sign means something different.  Next, use <code class="language-plaintext highlighter-rouge">dlltool</code> to build the import
library:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dlltool -d ucrtbase.def -l ucrtbase.lib
</code></pre></div></div>

<p>The equivalent MSVC tool is <a href="https://learn.microsoft.com/en-us/cpp/build/reference/overview-of-lib"><code class="language-plaintext highlighter-rouge">lib</code></a>, but as far as I know it cannot
quite do this sort of renaming. However, MSVC <code class="language-plaintext highlighter-rouge">link</code> will work just fine
with this <code class="language-plaintext highlighter-rouge">dlltool</code>-created import library. The name <code class="language-plaintext highlighter-rouge">ucrtbase.lib</code>, while
obvious, is irrelevant. It’s that <code class="language-plaintext highlighter-rouge">LIBRARY</code> line that ties it to the DLL.
My test source file looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">ucrt_malloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">msvcrt</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">)};</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">ucrt</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">)};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It compiles successfully:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -g3 -o main.exe main.c ucrtbase.lib
</code></pre></div></div>

<p>I can see the two <code class="language-plaintext highlighter-rouge">malloc</code> imports with <code class="language-plaintext highlighter-rouge">objdump</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p main.exe
...
DLL Name: msvcrt.dll
...
844a	 1021  malloc
...
DLL Name: ucrtbase.dll
847e	    1  malloc
</code></pre></div></div>

<p>It loads and runs successfully, too:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gdb main.exe
Reading symbols from main.exe...
(gdb) break 9
Breakpoint 1 at 0x1400013cd: file main.c, line 9.
(gdb) run
Thread 1 hit Breakpoint 1, main () at main.c:9
9           return 0;
(gdb) p msvcrt
$1 = {0xd06a30, 0xd06a70, 0xd06ab0}
(gdb) p ucrt
$2 = {0x6e9490, 0x6eb7c0, 0x6eb800}
</code></pre></div></div>

<p>The pointer addresses confirm that these are two, distinct allocators.
Perhaps you’re wondering what happens if I cross the streams?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">ucrt_malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The MSVCRT allocator justifiably panics over the bad pointer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">cc</span> <span class="o">-</span><span class="n">g3</span> <span class="o">-</span><span class="n">o</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span> <span class="n">chaos</span><span class="p">.</span><span class="n">c</span> <span class="n">ucrtbase</span><span class="p">.</span><span class="n">lib</span>
<span class="err">$</span> <span class="n">gdb</span> <span class="o">-</span><span class="n">ex</span> <span class="n">run</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span>
<span class="n">Starting</span> <span class="n">program</span><span class="o">:</span> <span class="n">chaos</span><span class="p">.</span><span class="n">exe</span>
<span class="n">warning</span><span class="o">:</span> <span class="n">HEAP</span><span class="p">[</span><span class="n">chaos</span><span class="p">.</span><span class="n">exe</span><span class="p">]</span><span class="o">:</span>
<span class="n">warning</span><span class="o">:</span> <span class="n">Invalid</span> <span class="n">address</span> <span class="n">specified</span> <span class="n">to</span> <span class="n">RtlFreeHeap</span>
<span class="n">Thread</span> <span class="mi">1</span> <span class="n">received</span> <span class="n">signal</span> <span class="n">SIGTRAP</span><span class="p">,</span> <span class="n">Trace</span><span class="o">/</span><span class="n">breakpoint</span> <span class="n">trap</span><span class="p">.</span>
<span class="mh">0x00007ffc42c369af</span> <span class="n">in</span> <span class="n">ntdll</span><span class="o">!</span><span class="n">RtlRegisterSecureMemoryCacheCallback</span> <span class="p">()</span>
<span class="p">(</span><span class="n">gdb</span><span class="p">)</span>
</code></pre></div></div>

<p>While you’re probably not supposed to meddle with <code class="language-plaintext highlighter-rouge">ucrtbase.dll</code> like
this, the general principle of export renames is reasonable. I don’t
expect I’ll ever need to do it, but I like that I have the option.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Everything you never wanted to know about Win32 environment blocks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/08/23/"/>
    <id>urn:uuid:3e73a0bb-fc27-4da2-9ae9-fab773a759d0</id>
    <updated>2023-08-23T21:51:10Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>In an effort to avoid <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ProgrammingViaSuperstition">programming by superstition</a>, I did a deep dive
into the Win32 “environment block,” the data structure holding a process’s
environment variables, in order to better understand it. Along the way I
discovered implied and undocumented behaviors. (The <em>environment block</em>
must not to be confused with the <a href="https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/pebteb/peb/index.htm">Process Environment Block</a> (PEB)
which is different.) Because I cannot possibly retain all the quirky
details in my head for long, I’m writing them down for future reference. I
ran my tests on different Windows versions as far back as Windows XP SP3
in order to fill in gaps where documentation is ambiguous, incomplete, or
wrong. Overall conclusion: Correct, direct manipulation of an environment
block is impossible <em>in the general case</em> due to under-specified and
incorrect documentation. This has important consequences mainly for
programming language runtimes.</p>

<p>Win32 has two interfaces for interacting with environment variables:</p>

<ol>
  <li><a href="https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getenvironmentvariable">GetEnvironmentVariable</a> and <a href="https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setenvironmentvariable">SetEnvironmentVariable</a></li>
  <li><a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getenvironmentstringsw">GetEnvironmentStrings</a> and <a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-freeenvironmentstringsw">FreeEnvironmentStrings</a></li>
</ol>

<p>The first, which I’ll call get/set, is the easy interface, with Windows
doing all the searching and sorting on your behalf. It’s also the only
supported interface through which a process can manipulate its own
variables. It has no function for enumerating variables.</p>

<p>The second, which I’ll call get/free, allocates a <em>copy of</em> the
environment block. Calls to get/set does not modify existing copies.
Similarly, manipulating this block has no effect on the environment as
viewed through get/set. In other words, it’s <em>read only</em>. We can enumerate
our environment variables by walking the environment block. As I will
discuss below, enumeration is it’s only consistently useful purpose!</p>

<p>Technically it’s possible to access the actual environment block through
undocumented fields in the PEB. It’s the same content as returned by
get/free except that it’s not a copy. It cannot be accessed safely, so I’m
ignoring this route.</p>

<p>The environment block format is a null-terminated block of null-terminated
strings:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>keyA=a\0keyBB=bb\0keyCCC=ccc\0\0
</code></pre></div></div>

<p>Each string <del>begins with a character other than <code class="language-plaintext highlighter-rouge">=</code> and</del> contains at
least one <code class="language-plaintext highlighter-rouge">=</code>. In my tests this rule was strictly enforced by Windows, and
I could not construct an environment block that broke this rule. This list
is usually, but not always, sorted. It may contain repeated variables, but
they’re always assigned the same value, which is also strictly enforced by
Windows.</p>

<p><del>The get/free interface has no “set” function, and a process cannot set
its own environment block to a custom buffer.</del> (Update: Stefan Kanthak
points out <a href="https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-setenvironmentstringsw">SetEnvironmentStringsW</a>. I missed it because it was only
officially documented a few months before this article was written.) There
<em>is</em> one interface where a process gets to provide a raw environment
block: <a href="https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw">CreateProcess</a>. That is, a parent can construct one for its
children.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">wchar_t</span> <span class="n">env</span><span class="p">[]</span> <span class="o">=</span> <span class="s">L"HOME=C:</span><span class="se">\\</span><span class="s">Users</span><span class="se">\\</span><span class="s">me</span><span class="se">\0</span><span class="s">PATH=C:</span><span class="se">\\</span><span class="s">bin;C:</span><span class="se">\\</span><span class="s">Windows</span><span class="se">\0</span><span class="s">"</span><span class="p">;</span>
    <span class="n">CreateProcessW</span><span class="p">(</span><span class="s">L"example.exe"</span><span class="p">,</span> <span class="p">...,</span> <span class="n">env</span><span class="p">,</span> <span class="p">...);</span>
</code></pre></div></div>

<p>Windows imposes some rules upon this environment block:</p>

<ul>
  <li>
    <p><del>If an element begins with <code class="language-plaintext highlighter-rouge">=</code> or does not contain <code class="language-plaintext highlighter-rouge">=</code>, CreateProcess
fails.</del></p>
  </li>
  <li>
    <p>Repeated variables are modified to match the first instance. If you’re
potentially overriding using a duplicate, put the override first.</p>
  </li>
  <li>
    <p>Some cases of bad formatting become memory access violations.</p>
  </li>
</ul>

<p>As usual for Win32, there are <a href="https://simonsapin.github.io/wtf-8/">no rules against ill-formed UTF-16</a>,
and I could <a href="/blog/2022/02/18/">always pass</a> such “UTF-16” through into the child
environment block. Keep that in mind even when using the get/set
interface.</p>

<p>The SetEnvironmentVariable documentation gives a maximum variable size:</p>

<blockquote>
  <p>The maximum size of a user-defined environment variable is 32,767
characters. There is no technical limitation on the size of the
environment block.</p>
</blockquote>

<p>At least on more recent versions of Windows, my experiments proved exactly
the opposite. There is no limit on a user-defined environment variables,
but environment blocks are limited to 2GiB, for both 32-bit and 64-bit
processes. I could even create such huge environments in <a href="https://learn.microsoft.com/en-us/cpp/build/reference/largeaddressaware-handle-large-addresses">large address
aware</a> 32-bit processes, though the interfaces are prone to error due
to allocations problems.</p>

<p>There’s one special case where CreateProcess is illogical, and it’s
certainly a case of confusion within its implementation. <strong>An environment
block is not allowed to be empty.</strong> An empty environment is represented as
a block containing one empty (zero length) element. That is, two null
terminators in a row. It’s the one case where an environment block may
contain an element without a <code class="language-plaintext highlighter-rouge">=</code>. The <em>logical</em> empty environment block
would be just one null terminator, to terminate the block itself, because
it contains no variables. You can safely pretend that’s the case when
parsing an environment block, as this special case is superfluous.</p>

<p>However, CreateProcess partially enforces this silly, unnecessary special
case! If an environment block begins with a null terminator, the next
character <em>must be in a mapped memory region</em> because it will read this
character. If it’s not mapped, the result is a memory access violation.
Its actual value doesn’t matter, and CreateProcess will treat it as though
it was another null terminator. Surely someone at Microsoft would have
noticed by now that this behavior makes no sense, but I guess it’s kept
for backwards compatibility?</p>

<p>The CreateProcess documentation says that “the system uses a sorted
environment” but this made no difference in my tests. The word “must”
appears in this sentence, but it’s unclear if it applies to sorting, or
even outside the special case being discussed. GetEnvironmentVariable
works fine on an unsorted environment block. SetEnvironmentVariable
maintains sorting, but given an unsorted block it goes somewhere in the
middle, probably wherever a bisection happens to land. Perhaps look-ups in
sorted blocks are faster, but environment blocks are so small — <del>a
maximum of 32K characters</del> (Update: only true for ANSI) — that, in
practice, it really does not matter.</p>

<p>Suppose you’re meticulous and want to sort your environment block before
spawning a process. How do you go about it? There’s the rub: The official
documentation is incomplete! The <a href="https://learn.microsoft.com/en-us/windows/win32/procthread/changing-environment-variables">Changing Environment Variables</a>
page says:</p>

<blockquote>
  <p>All strings in the environment block must be sorted alphabetically by
name. The sort is case-insensitive, Unicode order, without regard to
locale.</p>
</blockquote>

<p>What do they mean by “case-insensitive” sort? Does “Unicode order” mean
<a href="https://www.unicode.org/Public/15.0.0/ucd/CaseFolding.txt">case folding</a>? A reasonable guess, but no, that’s not how get/set
works. Besides, how does “Unicode order” apply to ill-formed UTF-16?
Worse, get/set sorting is certainly not “Unicode order” even outside of
case-insensitivity! For example, <code class="language-plaintext highlighter-rouge">U+1F31E</code> (SUN WITH FACE) sorts ahead of
<code class="language-plaintext highlighter-rouge">U+FF01</code> (FULLWIDTH EXCLAMATION MARK) because the former encodes in UTF-16
as <code class="language-plaintext highlighter-rouge">U+D83C U+DF1E</code>. Maybe it’s case-insensitive only in ASCII? Nope, π
(<code class="language-plaintext highlighter-rouge">U+03C0</code>) and Π (<code class="language-plaintext highlighter-rouge">U+03A0</code>) are considered identical. Windows uses some
kind of case-insensitive, but not case-<em>folded</em>, undocumented early 1990s
UCS-2 sorting logic for environment variables.</p>

<p><strong>Update</strong>: John Doty <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cc2a4c4d7-95cc-48a4-8047-c79b55eba261%40app.fastmail.com%3E">suspects</a> the <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlcompareunicodestring">RtlCompareUnicodeString</a>
function for sorting. It <a href="https://github.com/skeeto/scratch/blob/master/misc/envsort.c">lines up perfectly with get/set</a> for
all possible inputs.</p>

<p>Without better guidance, the only reliable way to “correctly” sort an
environment block is to build it with get/set, then retrieve the result
with get/free. The algorithm looks like:</p>

<ol>
  <li>Get a copy of the environment with GetEnvironmentStrings.</li>
  <li>Walk the environment and call SetEnvironmentVariable on each name with
a null pointer as the value. This clears out the environment.</li>
  <li>Call SetEnvironmentVariable for each variable in the new environment.</li>
  <li>Get a sorted copy of the new environment with GetEnvironmentStrings.</li>
</ol>

<p>Unfortunately that’s all global state, so you can only construct one new
environment block at a time.</p>

<p>If you know all your variable names ahead of time, then none of this is a
problem. Determine what Windows thinks the order should be, then use that
in your program when constructing the environment block. It’s the <em>general
case</em> where this is a challenge, such as a language runtime designed to
operate on arbitrary environment variables with behavior congruent to the
rest of the system.</p>

<p>There are similar issues with looking up variables in an environment
block. How does case-insensitivity work? Sorting is “without regard to
locale” but what about when comparing variable names? The documentation
doesn’t say. When enumerating variables using get/free, you might read
what get/set considers to be duplicates, though at least values will
always agree with get/set, i.e. they’re aliases of one variables. Windows
maintains that invariant in my tests. The above algorithm would also
delete these duplicates.</p>

<p>For example, if someone passed you a “dirty” environment with duplicates,
or that was unsorted, this would clean it up in a way that allows get/free
to be traversed in order without duplicates.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">GetEnvironmentStringsW</span><span class="p">();</span>

    <span class="c1">// Clear out the environment</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="o">*</span><span class="n">var</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">var</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">split</span> <span class="o">=</span> <span class="n">wcscspn</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="s">L"="</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="n">SetEnvironmentVariableW</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'='</span><span class="p">;</span>
        <span class="n">var</span> <span class="o">+=</span> <span class="n">len</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Restore the original variables</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">var</span> <span class="o">=</span> <span class="n">env</span><span class="p">;</span> <span class="o">*</span><span class="n">var</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">var</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">split</span> <span class="o">=</span> <span class="n">wcscspn</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="s">L"="</span><span class="p">);</span>
        <span class="n">var</span><span class="p">[</span><span class="n">split</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="n">SetEnvironmentVariableW</span><span class="p">(</span><span class="n">var</span><span class="p">,</span> <span class="n">var</span><span class="o">+</span><span class="n">split</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="n">var</span> <span class="o">+=</span> <span class="n">len</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">FreeEnvironmentStringsW</span><span class="p">(</span><span class="n">env</span><span class="p">);</span>
</code></pre></div></div>

<p>On the second pass, SetEnvironmentVariableW will gobble up all the
duplicates.</p>

<p>As a final note, the CreateProcess page had said this <a href="https://web.archive.org/web/20180110151515/http://msdn.microsoft.com/en-us/library/ms682425(VS.85).aspx">up until February
2023</a> about the environment block parameter:</p>

<blockquote>
  <p>If this parameter is <code class="language-plaintext highlighter-rouge">NULL</code> and the environment block of the parent
process contains Unicode characters, you must also ensure that
<code class="language-plaintext highlighter-rouge">dwCreationFlags</code> includes <code class="language-plaintext highlighter-rouge">CREATE_UNICODE_ENVIRONMENT</code>.</p>
</blockquote>

<p>That seems to indicate it’s virtually always wrong to call CreateProcess
without that flag — that is, Windows will trash the child’s environment
unless this flag is passed — which is a bonkers default. Fortunately this
appears to be wrong, which is probably why the documentation was finally
corrected (after several decades). Omitting this flag was fine under all
my tests, and I was unable to produce surprising behavior on any system.</p>

<p>In summary:</p>

<ul>
  <li>Prefer get/set for all operations except enumeration</li>
  <li>Environment blocks are not necessarily sorted</li>
  <li>Repeat variables are forced to the value of the first instance</li>
  <li>Variables may contain ill-formed UTF-16</li>
  <li>Empty environment blocks have a superfluous special case</li>
  <li><del>Entries cannot begin with <code class="language-plaintext highlighter-rouge">=</code></del></li>
  <li>Entries must contain at least one <code class="language-plaintext highlighter-rouge">=</code></li>
  <li>Sort order is ambiguous, so you cannot reliably do it yourself</li>
  <li>Case-insensitivity of names is ambiguous, so rely on get/set</li>
  <li><code class="language-plaintext highlighter-rouge">CREATE_UNICODE_ENVIRONMENT</code> necessary only for non-null environment</li>
</ul>

<p><strong>Update September 2024</strong>: Correction from Kasper Brandt <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C098b0421-af0e-46fb-8921-2a4e76f5a361@app.fastmail.com%3E">regarding
variables beginning with <code class="language-plaintext highlighter-rouge">=</code></a>. I misunderstood how it was parsed and
came to the wrong conclusion.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>"Once" one-time concurrent initialization with an integer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/07/31/"/>
    <id>urn:uuid:523b07ef-efc5-4d8a-a3e3-682f4c296161</id>
    <updated>2023-07-31T23:00:41Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>We’ve previously discussed <a href="/blog/2022/03/13/">integer barriers</a>, <a href="/blog/2022/05/14/">integer queues</a>, and
<a href="/blog/2022/10/05/">integer wait groups</a> as tiny concurrency utilities. Next let’s tackle
“once” initialization, i.e. <a href="https://man7.org/linux/man-pages/man3/pthread_once.3p.html"><code class="language-plaintext highlighter-rouge">pthread_once</code></a>, using an integer.
We’ll need only three basic atomic operations — store, load, and increment
— and futex wait/wake. It will be zero-initialized and the entire source
small enough to fit on an old-fashioned terminal display. The interface
will also get an overhaul, more to my own tastes.</p>

<p>If you’d like to skip ahead: <a href="https://github.com/skeeto/scratch/blob/master/misc/once.c"><strong><code class="language-plaintext highlighter-rouge">once.c</code></strong></a></p>

<p>What’s the purpose? Suppose a concurrent program requires initialization,
but has no definite moment to do so. Threads are already in motion, and
it’s unpredictable which will arrive first, and when. It might be because
this part of the program is loaded lazily, or initialization is expensive
and only done lazily as needed. A “once” object is a control allowing the
first arrival to initialize, and later arrivals to wait until
initialization done.</p>

<p>The pthread version has this interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pthread_once_t</span> <span class="n">once</span> <span class="o">=</span> <span class="n">PTHREAD_ONCE_INIT</span><span class="p">;</span>
<span class="kt">int</span> <span class="nf">pthread_once</span><span class="p">(</span><span class="n">pthread_once_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">init</span><span class="p">)(</span><span class="kt">void</span><span class="p">));</span>
</code></pre></div></div>

<p>It’s deliberately quite limited, and <a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_once.html">the specification</a> refers to
it merely as “dynamic package initialization.” That is, it’s strictly for
initializing global package data, not individual objects, and a “once”
object must be a static variable, not dynamically allocated. Also note the
lack of context pointer for the callback. No pthread implementation I
examined was actually so restricted, but the specification is written for
the least common denominator, and the interface is clearly designed
against more general use.</p>

<p>An example of lazily static table initialization for <a href="https://github.com/skeeto/prng64-shootout/blob/master/blowfish.c">a cipher</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Blowfish subkey tables (constants)</span>
<span class="k">static</span> <span class="kt">uint32_t</span> <span class="n">blowfish_p</span><span class="p">[</span><span class="mi">20</span><span class="p">];</span>
<span class="k">static</span> <span class="kt">uint32_t</span> <span class="n">blowfish_s</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
<span class="k">static</span> <span class="n">pthread_once_t</span> <span class="n">once</span> <span class="o">=</span> <span class="n">PTHREAD_ONCE_INIT</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... populate blowfish_p and blowfish_s with pi ...</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">blowfish_encrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">pthread_once</span><span class="p">(</span><span class="o">&amp;</span><span class="n">once</span><span class="p">,</span> <span class="n">init</span><span class="p">);</span>
    <span class="c1">// ... lookups into blowfish_p and blowfish_s ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pthread_once</code> allows <code class="language-plaintext highlighter-rouge">blowfish_encrypt</code> to be called concurrently (on
different context objects). The first call populates lookup tables and
others wait as needed. A good <code class="language-plaintext highlighter-rouge">pthread_once</code> will speculate initialization
has already completed and make that the fast path. The tables do not
require locks or atomics because <code class="language-plaintext highlighter-rouge">pthread_once</code> establishes a
synchronization edge: initialization <em>happens-before</em> the return from
<code class="language-plaintext highlighter-rouge">pthread_once</code>.</p>

<p>Go’s <code class="language-plaintext highlighter-rouge">sync.Once</code> has a similar interface:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">o</span> <span class="o">*</span><span class="n">Once</span><span class="p">)</span> <span class="n">Do</span><span class="p">(</span><span class="n">f</span> <span class="k">func</span><span class="p">())</span>
</code></pre></div></div>

<p>It’s more flexible and not restricted to global data, but retains the
callback interface.</p>

<h3 id="a-new-once-interface">A new “once” interface</h3>

<p>Callbacks are clunky, especially without closures, so in my re-imagining I
wanted to remove it from the interface. Instead I broke out exit and
entry. The in-between takes the place of the callback and it runs in its
original context.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">do_once</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">once_done</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>This is similar to breaking “push” and “pop” each into two steps in my
concurrent queue. <code class="language-plaintext highlighter-rouge">do_once</code> returns true if initialization is required,
otherwise it returns false <em>after</em> initialization has completed, i.e. it
blocks. The initializing thread signals that initialization is complete by
calling <code class="language-plaintext highlighter-rouge">once_done</code>. As mentioned, the “once” object would be
zero-initialized. Reworking the above example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Blowfish subkey tables (constants)</span>
<span class="k">static</span> <span class="kt">uint32_t</span> <span class="n">blowfish_p</span><span class="p">[</span><span class="mi">20</span><span class="p">];</span>
<span class="k">static</span> <span class="kt">uint32_t</span> <span class="n">blowfish_s</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">once</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">void</span> <span class="nf">blowfish_encrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">do_once</span><span class="p">(</span><span class="o">&amp;</span><span class="n">once</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// ... populate blowfish_p and blowfish_s with pi ...</span>
        <span class="n">once_done</span><span class="p">(</span><span class="o">&amp;</span><span class="n">once</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// ... lookups into blowfish_p and blowfish_s ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It gets more interesting when taken beyond global initialization. Here
each object is lazily initialized by the first thread to use it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">once</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">}</span> <span class="n">Thing</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">expensive_init</span><span class="p">(</span><span class="n">Thing</span> <span class="o">*</span><span class="p">,</span> <span class="kt">ptrdiff_t</span><span class="p">);</span>

<span class="k">static</span> <span class="kt">double</span> <span class="nf">compute</span><span class="p">(</span><span class="n">Thing</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">index</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">do_once</span><span class="p">(</span><span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">once</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">expensive_init</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">index</span><span class="p">);</span>
        <span class="n">once_done</span><span class="p">(</span><span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">once</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">Thing</span> <span class="o">*</span><span class="n">things</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="mi">1000000</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">Thing</span><span class="p">));</span>
    <span class="cp">#pragma omp parallel for
</span>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">iterations</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">ptrdiff_t</span> <span class="n">which</span> <span class="o">=</span> <span class="n">random_access</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
        <span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="n">compute</span><span class="p">(</span><span class="o">&amp;</span><span class="n">things</span><span class="p">[</span><span class="n">which</span><span class="p">],</span> <span class="n">which</span><span class="p">);</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="implementation-details">Implementation details</h3>

<p>A “once” object must express at least these three states:</p>

<ol>
  <li>Uninitialized</li>
  <li>Undergoing initialization</li>
  <li>Initialized</li>
</ol>

<p>To support zero-initialization, (1) must map into zero. A thread observing
(1) must successfully transition to (2) before attempting to initialize. A
thread observing (2) must wait for a transition to (3). Observing (3) is
the fast path, and the implementation should optimize for it.</p>

<p>The trickiest part is the state transition from (1) to (2). If multiple
threads are attempting the transition concurrently, only one should “win”.
The obvious choice is a <a href="/blog/2014/09/02/">compare-and-swap</a> atomic, which will fail if
another thread has already made the transition. However, with a more
careful selection of state representation, we can do this with just an
atomic increment!</p>

<p>The secret sauce: (2) will be <strong>any positive value</strong> and (3) will be <strong>any
negative value</strong>. The “winner” is the thread that increments from zero to
one. Other threads that also observed zero will increment to a different
value, after which they behave as though they did not observe (1) in the
first place.</p>

<p>I chose shorthand names for the three atomic and two futex operations.
Each can be defined with a single line of code — the atomics with
<a href="https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html">compiler intrinsics</a> and the futex with system calls, as they
interact with the system scheduler. (See the “four elements” of <a href="/blog/2022/10/05/">the wait
group article</a>.) Technically it will still work correctly if the futex
calls are no-ops, though it would waste time spinning on the slow path. In
a real program you’d probably use less pithy names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>  <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">store</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span>  <span class="nf">incr</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>From here it’s useful to work backwards, starting with <code class="language-plaintext highlighter-rouge">once_done</code>,
because there’s an important detail, another secret sauce ingredient:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">once_done</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">once</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">store</span><span class="p">(</span><span class="n">once</span><span class="p">,</span> <span class="n">INT_MIN</span><span class="p">);</span>
    <span class="n">wake</span><span class="p">(</span><span class="n">once</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Recall that the “initialized” state (3) is negative. We don’t just pick
any arbitrary negative, especially not the obvious -1, but <em>the most
negative value</em>. Keep that in mind. Once set, wake up any waiters. Since
this is the slow path, we don’t care to avoid the system call if there are
no waiters. Now <code class="language-plaintext highlighter-rouge">do_once</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">do_once</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">once</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="n">once</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">=</span> <span class="n">incr</span><span class="p">(</span><span class="n">once</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">r</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">wait</span><span class="p">(</span><span class="n">once</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span>
        <span class="n">r</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="n">once</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First, check for the fast path. If we’re already in state (3), return
immediately. If <code class="language-plaintext highlighter-rouge">do_once</code> will be placed in a separate translation unit
from the caller, we might extract this check such that it can be inlined
at the call site. Once initialization has settled, nobody will be mutating
<code class="language-plaintext highlighter-rouge">*once</code>, so this will be a fast, uncontended atomic load, though mind your
cache lines for false sharing.</p>

<p>If we’re in state (1), try to transition to state (2). If we incremented
to 1, we won so tell the caller to initialize. Otherwise continue as
though we never saw state (1). There’s an important subtlety easy to miss:
Initialization may have already completed before the increment. That is,
<code class="language-plaintext highlighter-rouge">*once</code> may have been negative for the increment! Fortunately since we
chose <code class="language-plaintext highlighter-rouge">INT_MIN</code> in <code class="language-plaintext highlighter-rouge">once_done</code>, it will <em>stay negative</em>. (Assuming you
have less than 2 billion threads contending <code class="language-plaintext highlighter-rouge">*once</code>. Ha!) So it’s vital to
check <code class="language-plaintext highlighter-rouge">r</code> again for negative after the increment, hence <code class="language-plaintext highlighter-rouge">while</code> instead of
<code class="language-plaintext highlighter-rouge">do while</code>.</p>

<p>Losers continuing to increment <code class="language-plaintext highlighter-rouge">*once</code> may interfere with the futex wait,
but, again, this is the slow path so that’s fine. Eventually we will wake
up and observe (3), then give control back to the caller.</p>

<p>That’s all there is to it. If you haven’t already, check out the source
including tests for for Windows and Linux: <a href="https://github.com/skeeto/scratch/blob/master/misc/once.c"><strong><code class="language-plaintext highlighter-rouge">once.c</code></strong></a>. Suggested
experiments to try, particularly under a debugger:</p>

<ul>
  <li>Change <code class="language-plaintext highlighter-rouge">INT_MIN</code> to <code class="language-plaintext highlighter-rouge">-1</code>.</li>
  <li>Change <code class="language-plaintext highlighter-rouge">while (r &gt; 0) { ... }</code> to <code class="language-plaintext highlighter-rouge">do { ... } while (r &gt; 0);</code>.</li>
  <li>Comment out the futex system calls. (Note: will be very slow without
also reducing <code class="language-plaintext highlighter-rouge">NTHREADS</code>.)</li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Solving "Two Sum" in C with a tiny hash table</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/06/26/"/>
    <id>urn:uuid:5d15318f-6915-4f72-8690-74a84d43d2f7</id>
    <updated>2023-06-26T19:38:18Z</updated>
    <category term="c"/><category term="go"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I came across a question: How does one efficiently solve <a href="https://leetcode.com/problems/two-sum/">Two Sum</a> in C?
There’s a naive quadratic time solution, but also an amortized linear time
solution using a hash table. Without a built-in or standard library hash
table, the latter sounds onerous. However, a <a href="/blog/2022/08/08/">mask-step-index table</a>,
a hash table construction suitable for many problems, requires only a few
lines of code. This approach is useful even when a standard hash table is
available, because by <a href="https://vimeo.com/644068002">exploiting the known problem constraints</a>, it
beats typical generic hash table performance by an order of magnitude
(<a href="https://gist.github.com/skeeto/7119cf683662deae717c0d4e79ebf605">demo</a>).</p>

<p>The Two Sum exercise, restated:</p>

<blockquote>
  <p>Given an integer array and target, return the distinct indices of two
elements that sum to the target.</p>
</blockquote>

<p>In particular, the solution doesn’t find elements, but their indices. The
exercise also constrains input ranges — important but easy to overlook:</p>

<ul>
  <li>2 &lt;= <code class="language-plaintext highlighter-rouge">count</code> &lt;= 10<sup>4</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">nums[i]</code> &lt;= 10<sup>9</sup></li>
  <li>-10<sup>9</sup> &lt;= <code class="language-plaintext highlighter-rouge">target</code> &lt;= 10<sup>9</sup></li>
</ul>

<p>Notably, indices fit in a 16-bit integer with lots of room to spare. In
fact, it will fit in a 14-bit address space (16,384) with still plenty of
overhead. Elements fit in a signed 32-bit integer, and we can add and
subtract elements without overflow, if just barely. The last constraint
isn’t redundant, but it’s not readily exploitable either.</p>

<p>The naive solution is to linearly search the array for the complement.
With nested loops, it’s obviously quadratic time. At 10k elements, we
expect an abysmal 25M comparisons on average.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">count</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span> <span class="o">=</span> <span class="p">...;</span>

<span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">target</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// found</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">nums</code> array is “keyed” by index. It would be better to also have the
inverse mapping: key on elements to obtain the <code class="language-plaintext highlighter-rouge">nums</code> index. Then for each
element we could compute the complement and find its index, if any, using
this second mapping.</p>

<p>The input range is finite, so an inverse map is simple. Allocate an array,
one element per integer in range, and store the index there. However, the
input range is 2 billion, and even with 16-bit indices that’s a 4GB array.
Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed
to make it so. This array would be very sparse, at most less than half a
percent of its elements populated. That’s a hint: Associative arrays are
far more appropriate for representing such sparse mappings. That is, a
hash table.</p>

<p>Using Go’s built-in hash table:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithMap</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">seen</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="kt">int32</span><span class="p">]</span><span class="kt">int16</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="k">if</span> <span class="n">j</span><span class="p">,</span> <span class="n">ok</span> <span class="o">:=</span> <span class="n">seen</span><span class="p">[</span><span class="n">complement</span><span class="p">];</span> <span class="n">ok</span> <span class="p">{</span>
            <span class="k">return</span> <span class="kt">int</span><span class="p">(</span><span class="n">j</span><span class="p">),</span> <span class="n">i</span><span class="p">,</span> <span class="no">true</span>
        <span class="p">}</span>
        <span class="n">seen</span><span class="p">[</span><span class="n">num</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In essence, the hash table folds the sparse 2 billion element array onto a
smaller array, with collision resolution when elements inevitably land in
the same slot. For this exercise, that small array could be as small as
10,000 elements because that’s the most we’d ever need to track. For
folding the large key space onto the smaller, we could use modulo. For
collision resolution, we could keep walking the table.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">10000</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>

<span class="c1">// Find or insert nums[index].</span>
<span class="kt">int16_t</span> <span class="nf">lookup</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">index</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// empty slot</span>
            <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// insert biased index</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">nums</span><span class="p">[</span><span class="n">index</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">j</span><span class="p">;</span>  <span class="c1">// match found</span>
        <span class="p">}</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10000</span><span class="p">;</span>  <span class="c1">// keep looking</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Take note of a few details:</p>

<ol>
  <li>
    <p>An empty slot is zero, and an empty table is a zero-initialized array.
Since zero is a valid value, and all values are non-negative, it biases
values by 1 in the table.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">nums</code> array is part of the table structure, necessary for lookups.
<strong>The two mappings — element-by-index and index-by-element — share
structure.</strong></p>
  </li>
  <li>
    <p>It uses <em>open addressing</em> with <em>linear probing</em>, and so walks the table
until it either either finds the element or hits an empty slot.</p>
  </li>
  <li>
    <p>The “hash” function is modulo. If inputs are not random, they’ll tend
to bunch up in the table. Combined with linear probing makes for lots
of collisions. For the worst case, imagine sequentially ordered inputs.</p>
  </li>
  <li>
    <p>Sometimes the table will almost completely fill, and lookups will be no
better than the linear scans of the naive solution.</p>
  </li>
  <li>
    <p>Most subtle of all: This hash table is not enough for the exercise. The
keyed-on element may not even be in <code class="language-plaintext highlighter-rouge">nums</code>, and when lookup fails, that
element is not inserted in the table. Instead, a different element is
inserted. The conventional solution has at least two hash table
lookups. <strong>In the Go code, it’s <code class="language-plaintext highlighter-rouge">seen[complement]</code> for lookups and
<code class="language-plaintext highlighter-rouge">seen[num]</code> for inserts.</strong></p>
  </li>
</ol>

<p>To solve (4) we’ll use a hash function to more uniformly distribute
elements in the table. We’ll also probe the table in a random-ish order
that depends on the key. In practice there will be little bunching even
for non-random inputs.</p>

<p>To solve (5) we’ll use a larger table: 2<sup>14</sup> or 16,384 elements.
This has breathing room, and with a power of two we can use a fast mask
instead of a slow division (though in practice, compilers usually
implement division by a constant denominator with modular multiplication).</p>

<p>To solve (6) we’ll key complements together under the same key. It looks
for the complement, but on failure it inserts the current element in the
empty slot. In other words, <strong>this solution will only need a single hash
table lookup per element!</strong></p>

<p>Laying down some groundwork:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int16_t</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">TwoSum</span><span class="p">;</span>

<span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">seen</code> array is a 32KiB hash table large enough for all inputs, small
enough that it can be a local variable. In the loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>

<p>Compute the complement, then apply a “max” operation to derive a key. Any
commutative operation works, though obviously addition would be a poor
choice. XOR is similar enough to cause many collisions. Multiplication
works well, and is probably better if the ternary produces a branch.</p>

<p>The hash function is multiplication with <a href="/blog/2019/11/19/">a randomly-chosen prime</a>.
As we’ll see in a moment, <code class="language-plaintext highlighter-rouge">step</code> will also add-shift the hash before use.
The initial index will be the bottom 14 bits of this hash. For <code class="language-plaintext highlighter-rouge">step</code>,
recall from the MSI article that it must be odd so that every slot is
eventually probed. I shift out 13 bits and then override the 14th bit, so
<code class="language-plaintext highlighter-rouge">step</code> effectively skips over the 14 bits used for the initial table
index.</p>

<p>I used <code class="language-plaintext highlighter-rouge">unsigned</code> because I don’t really care about the width of the hash
table index, but more importantly, I want defined overflow from all the
bit twiddling, even in the face of implicit promotion. As a bonus, it can
help in reasoning about indirection: <code class="language-plaintext highlighter-rouge">seen</code> indices are <code class="language-plaintext highlighter-rouge">unsigned</code>, <code class="language-plaintext highlighter-rouge">nums</code>
indices are <code class="language-plaintext highlighter-rouge">int16_t</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
</code></pre></div></div>

<p>The step is added before using the index the first time, helping to
scatter the start point and reduce collisions. If it’s an empty slot,
insert the <em>current</em> element, not the complement — which wouldn’t be
possible anyway. Unlike conventional solutions, this doesn’t require
another hash and lookup. If it finds the complement, problem solved,
otherwise keep going.</p>

<p>Putting it all together, it’s only slightly longer than solutions using a
generic hash table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TwoSum</span> <span class="nf">twosum</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">nums</span><span class="p">,</span> <span class="kt">int16_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">TwoSum</span> <span class="n">r</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="kt">int16_t</span> <span class="n">seen</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int16_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">complement</span> <span class="o">=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">int32_t</span> <span class="n">key</span> <span class="o">=</span> <span class="n">complement</span><span class="o">&gt;</span><span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">?</span> <span class="n">complement</span> <span class="o">:</span> <span class="n">nums</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">key</span> <span class="o">*</span> <span class="mi">489183053u</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">mask</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="mi">13</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
            <span class="kt">int16_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// unbias</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">j</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// bias and insert</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">r</span><span class="p">.</span><span class="n">i</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">j</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
                <span class="n">r</span><span class="p">.</span><span class="n">ok</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Applying this technique to Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">TwoSumWithBespoke</span><span class="p">(</span><span class="n">nums</span> <span class="p">[]</span><span class="kt">int32</span><span class="p">,</span> <span class="n">target</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">bool</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">var</span> <span class="n">seen</span> <span class="p">[</span><span class="m">1</span> <span class="o">&lt;&lt;</span> <span class="m">14</span><span class="p">]</span><span class="kt">int16</span>
    <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">num</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">nums</span> <span class="p">{</span>
        <span class="n">complement</span> <span class="o">:=</span> <span class="n">target</span> <span class="o">-</span> <span class="n">num</span>
        <span class="n">hash</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">num</span> <span class="o">*</span> <span class="n">complement</span> <span class="o">*</span> <span class="m">489183053</span><span class="p">)</span>
        <span class="n">mask</span> <span class="o">:=</span> <span class="nb">len</span><span class="p">(</span><span class="n">seen</span><span class="p">)</span> <span class="o">-</span> <span class="m">1</span>
        <span class="n">step</span> <span class="o">:=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="m">13</span> <span class="o">|</span> <span class="m">1</span>
        <span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="n">hash</span><span class="p">;</span> <span class="p">;</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span>
            <span class="n">j</span> <span class="o">:=</span> <span class="kt">int</span><span class="p">(</span><span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="c">// unbias</span>
            <span class="k">if</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="m">0</span> <span class="p">{</span>
                <span class="n">seen</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="kt">int16</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">+</span> <span class="m">1</span> <span class="c">// bias</span>
                <span class="k">break</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">nums</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="n">complement</span> <span class="p">{</span>
                <span class="k">return</span> <span class="n">j</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="no">true</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="no">false</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With Go 1.20 this is an order of magnitude faster than <code class="language-plaintext highlighter-rouge">map[int32]int16</code>,
which isn’t surprising. I used multiplication as the key operator because,
in my first take, Go produced a branch for the “max” operation — at a 25%
performance penalty on random inputs.</p>

<p>A full-featured, generic hash table may be overkill for your problem, and
a bit of hashed indexing with collision resolution over a small array
might be sufficient. The problem constraints might open up such shortcuts.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Hand-written Windows API prototypes: fast, flexible, and tedious</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/05/31/"/>
    <id>urn:uuid:35b44114-7ad2-422b-9eaf-dc37e7eaaf97</id>
    <updated>2023-05-31T01:38:31Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I love fast builds, and for years I’ve been bothered by the build penalty
for translation units including <code class="language-plaintext highlighter-rouge">windows.h</code>. This header has an enormous
number of definitions and declarations and so, for C programs, it tends to
dominate the build time of those translation units. Most programs,
especially systems software, only needs a tiny portion of it. For example,
when compiling <a href="/blog/2023/01/18/">u-config</a> with GCC, two thirds of the debug build was
spent processing <code class="language-plaintext highlighter-rouge">windows.h</code> just for <a href="https://github.com/skeeto/u-config/blob/e6ebb9b/miniwin32.h">4 types, 16 definitions, and 16
prototypes</a>.</p>

<p>To give a sense of the numbers, here’s <code class="language-plaintext highlighter-rouge">empty.c</code>, which does nothing but
include <code class="language-plaintext highlighter-rouge">windows.h</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E empty.c | grep -vc '^$'
82041
</code></pre></div></div>

<p>With <a href="https://github.com/skeeto/w64devkit">w64devkit</a> this takes my system ~450ms to compile with GCC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Compiling an actually empty source file takes ~10ms, so it really is
spending practically all that time processing headers. MSVC is a faster
compiler, and this extends to processing an even larger <code class="language-plaintext highlighter-rouge">windows.h</code> that
crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s
</code></pre></div></div>

<p>That’s just low enough to be tolerable, but I’d like the situation with
GCC to be better. Defining <code class="language-plaintext highlighter-rouge">WIN32_LEAN_AND_MEAN</code> reduces the number of
included headers, which has a significant effect:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s
</code></pre></div></div>

<h3 id="precompiled-headers">Precompiled headers</h3>

<p>The official solution is precompiled headers. Put all the system header
includes, <a href="/blog/2023/01/08/">or similar</a>, into a dedicated header, then compile that
header into a special format. For example, <code class="language-plaintext highlighter-rouge">headers.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">main.c</code> includes <code class="language-plaintext highlighter-rouge">windows.h</code> through this header:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"headers.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I ask <a href="https://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html">GCC to compile <code class="language-plaintext highlighter-rouge">headers.h</code></a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc headers.h
</code></pre></div></div>

<p>It produces <code class="language-plaintext highlighter-rouge">headers.h.gch</code>. When a source includes <code class="language-plaintext highlighter-rouge">headers.h</code>, GCC first
searches for an appropriate <code class="language-plaintext highlighter-rouge">.gch</code>. Not only must the name match, but so
must all the definitions at the moment of inclusion: <code class="language-plaintext highlighter-rouge">headers.h</code> should
always be the first included header, otherwise it may not work. Now when I
compile <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>Much better! MSVC has a conventional name for this header recognizable to
every Visual Studio user: <code class="language-plaintext highlighter-rouge">stdafx.h</code>. It works a bit differently, and I’ve
never used it myself, but I trust it has similar results.</p>

<p>Precompiled headers requires some extra steps that vary by toolchain. Can
we do better? That depends on your definition of “better!”</p>

<h3 id="artisan-handcrafted-prototypes">Artisan, handcrafted prototypes</h3>

<p>As mentioned, systems software tends to need only a few declarations:
open, read, write, stat, etc. What if I wrote these out manually? A bit
tedious, but it doesn’t require special precompiled header handling. It
also creates some new possibilities. To illustrate, a <a href="/blog/2023/02/15/">CRT-free</a>
“hello world” program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This takes my system half a second to compile — quite long to produce just
26 assembly instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!
</code></pre></div></div>

<p>The program requires prototypes only for GetStdHandle and WriteFile, a
definition for <code class="language-plaintext highlighter-rouge">STD_OUTPUT_HANDLE</code>, and some typedefs. Starting with the
easy stuff, the definition and <a href="https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types">types look like this</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define STD_OUTPUT_HANDLE ((DWORD)-11)
</span>
<span class="k">typedef</span> <span class="kt">int</span> <span class="n">BOOL</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="n">HANDLE</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">DWORD</span><span class="p">;</span>
</code></pre></div></div>

<p>By the way, here’s a cheat code for quickly finding preprocessor
definitions, faster than looking them up elsewhere:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo '#include &lt;windows.h&gt;' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)
</code></pre></div></div>

<p>Did you catch the pattern? It’s <code class="language-plaintext highlighter-rouge">-10 - fd</code>, where <code class="language-plaintext highlighter-rouge">fd</code> is the conventional
unix file descriptor number: a kind of mnemonic.</p>

<p>Prototypes are a little trickier, especially if you care about 32-bit. The
Windows API uses the “stdcall” calling convention, which is distinct from
the “cdecl” calling convention on x86, though the same on x64. Of course,
you must already be aware of this merely using the API, as your own
callbacks must usually be stdcall themselves. Further, API functions are
<a href="/blog/2021/05/31/">DLL imports</a> and should be declared as such. Putting it together,
here’s GetStdHandle:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="n">HANDLE</span> <span class="kr">__stdcall</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">);</span>
</code></pre></div></div>

<p>This works with both Mingw-w64 and MSVC. MSVC requires <code class="language-plaintext highlighter-rouge">__stdcall</code> between
the return type and function name, so don’t get clever about it. If you
only care about GCC then you can declare both at once using attributes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="nf">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">)</span>
    <span class="n">__attribute__</span><span class="p">((</span><span class="n">dllimport</span><span class="p">,</span><span class="n">stdcall</span><span class="p">));</span>
</code></pre></div></div>

<p>I like to hide all this behind a macro, with a “table” of all my imports
listed just below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r) __declspec(dllimport) r __stdcall
</span><span class="n">W32</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">)</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">DWORD</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="n">BOOL</span><span class="p">)</span>   <span class="n">WriteFile</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">DWORD</span><span class="p">,</span> <span class="n">DWORD</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>In WriteFile you may have noticed I’m taking shortcuts. The “official”
definition uses an ugly pointer typedef, <code class="language-plaintext highlighter-rouge">LPCVOID</code>, instead of pointer
syntax, but I skipped that type definition. I also replaced the last
argument, an <code class="language-plaintext highlighter-rouge">OVERLAPPED</code> pointer, with a generic pointer. I only need to
pass null. I can keep sanding it down to something more ergonomic:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">W32</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span>    <span class="n">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s how I typically write these prototypes. I dropped the <code class="language-plaintext highlighter-rouge">const</code>
because it doesn’t help me. I used signed sizes because I like them better
and it’s <a href="/blog/2023/02/13/">what I’m usually holding</a> at the call site. But doesn’t
changing the signedness potentially break compatibility? It makes no
difference to any practical ABI: It’s passed the same way. In general,
signedness is a matter for <em>operators</em>, and only some of them — mainly
comparisons (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, etc.) and division. It’s a similar story for
pointers starting with the 32-bit era, so I can choose whatever pointer
types are convenient.</p>

<p>In general, I can do anything I want so long as I know my compiler will
produce an appropriate function call. These are not standard functions,
like <code class="language-plaintext highlighter-rouge">printf</code> or <code class="language-plaintext highlighter-rouge">memcpy</code>, which are implemented in part by the compiler
itself, but foreign functions. It’s no different than teaching <a href="/blog/2018/05/27/">an
FFI</a> how to make a call. This is also, in essence, how OpenGL and
Vulkan work, with applications <a href="https://www.khronos.org/opengl/wiki/OpenGL_Loading_Library">defining the API for themselves</a>.</p>

<p>Considering all this, my new hello world:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define W32(r) __declspec(dllimport) r __stdcall
</span><span class="n">W32</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
<span class="n">W32</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span>    <span class="n">WriteFile</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You know, there’s a kind of beauty to a program that requires no external
definitions. It builds quickly and produces a binary bit-for-bit identical
to the original:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s
</code></pre></div></div>

<p>I’ve also been using this to patch over API rough edges. For example,
<a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsarecvfrom">WSARecvFrom</a> takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-wsaoverlapped">WSAOVERLAPPED</a>, but <a href="https://learn.microsoft.com/en-us/windows/win32/api/ioapiset/nf-ioapiset-getqueuedcompletionstatus">GetQueuedCompletionStatus</a>
takes <a href="https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-overlapped">OVERLAPPED</a>. These types are explicitly compatible, and only
defined separately for annoying technical reasons. I must use the same
overlapped object with both APIs at once, meaning I would normally need
ugly pointer casts on my Winsock calls, or vice versa with I/O completion
ports. But because I’m writing all these definitions myself, I can define
a common overlapped structure for both!</p>

<p>Perhaps you’re worried that this would be too fragile. Well, as a legacy
software aficionado, I enjoy <a href="/blog/2018/04/13/">building and running my programs on old
platforms</a>. So far these programs still work properly <a href="https://winworldpc.com/library/">going back
30 years</a> to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag,
it’s always been a bug (now long fixed) in the old operating system, not
in my programs or these prototypes. So, in effect, this technique has
worked well for the past 30 years!</p>

<p>Writing out these definitions is a bit of a chore, but after paying that
price I’ve been quite happy with the results. I will likely continue doing
it in the future, at least for non-graphical applications.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>My favorite C compiler flags during development</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/04/29/"/>
    <id>urn:uuid:a90f3f5b-b4c3-4153-ac8e-6cdbf235f44b</id>
    <updated>2023-04-29T22:55:25Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=35758898">on Hacker News</a> and <a href="https://old.reddit.com/r/C_Programming/comments/133bjlp">on reddit</a>.</em></p>

<p>The major compilers have an <a href="https://man7.org/linux/man-pages/man1/gcc.1.html">enormous number of knobs</a>. Most are
highly specialized, but others are generally useful even if uncommon. For
warnings, the venerable <code class="language-plaintext highlighter-rouge">-﻿Wall -﻿Wextra</code> is a good start, but
circumstances improve by tweaking this warning set. This article covers
high-hitting development-time options in GCC, Clang, and MSVC that ought
to get more consideration.</p>

<!--more-->

<p>There’s an irony that the more you use these options, the less useful they
become. Given a reasonable workflow, they are a harsh mistress in a fast,
tight feedback loop quickly breaking the habits that cause warnings and
errors. It’s a kind of self-improvement, where eventually most findings
will be false positives. With heuristics internalized, you will be able
spot the same issues just reading code — a handy skill during code review.</p>

<h3 id="static-warnings">Static warnings</h3>

<p>Traditionally, C and C++ compilers are by default conservative with
warnings. Unless configured otherwise, they only warn about the most
egregious issues where it’s highly confident. That’s too conservative. For
<code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>, the first order of business is turning on more warnings
with <strong><code class="language-plaintext highlighter-rouge">-﻿Wall</code></strong>. Despite the name, this doesn’t actually enable all
warnings. (<code class="language-plaintext highlighter-rouge">clang</code> has <code class="language-plaintext highlighter-rouge">-﻿Weverything</code> which does literally this, but
trust me, you don’t want it.) However, that still falls short, and you’re
better served enabling <em>extra</em> warnings on with <strong><code class="language-plaintext highlighter-rouge">-﻿Wextra</code></strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Wall -Wextra ...
</code></pre></div></div>

<p>That should be the baseline on any new project, and closer to what these
compilers should do by default. Not using these means leaving value on the
table. If you come across such a project, there’s a good chance you can
find bugs statically just by using this baseline. Some warnings only occur
at higher <a href="https://www.openwall.com/lists/musl/2023/05/22/2/1">optimization levels</a>, so leave these on for your release
builds, too.</p>

<p>For MSVC, including <code class="language-plaintext highlighter-rouge">clang-cl</code>, a similar baseline is <strong><code class="language-plaintext highlighter-rouge">/W4</code></strong>. Though it
goes a bit far, warning about use of unary minus on unsigned types
(C4146), and sign conversions (C4245). If you’re <a href="/blog/2023/02/15/">using a CRT</a>, also
disable the bogus and irresponsible “security” warnings. Putting it
together, the warning baseline becomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...
</code></pre></div></div>

<p>As for <code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>, I dislike unused parameter warnings, so I often
turn it off, at least while I’m working: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unused-parameter</code></strong>.
Rarely is it a defect to not use a parameter. It’s common for a function
to fit a fixed prototype but not need all its parameters (e.g. <code class="language-plaintext highlighter-rouge">WinMain</code>).
Were it up to me, this would not be part of <code class="language-plaintext highlighter-rouge">-﻿Wextra</code>.</p>

<p>I also dislike unused functions warnings: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unused-function</code></strong>.
I can’t say this is wrong for the baseline since, in most cases, ultimately
I do want to know if there are unused functions, e.g. to be deleted. But
while I’m working it’s usually noise.</p>

<p>If I’m <a href="/blog/2017/03/01/">working with OpenMP</a>, I may also disable warnings about
unknown pragmas: <strong><code class="language-plaintext highlighter-rouge">-﻿Wno-unknown-pragmas</code></strong>. One cool feature of
OpenMP is that the typical case gracefully degrades to single-threaded
behavior when not enabled. That is, compiling without <code class="language-plaintext highlighter-rouge">-﻿fopenmp</code>.
I’ll test both ways to ensure I get deterministic results, or just to ease
debugging, and I don’t want warnings when it’s disabled. It’s fine for the
baseline to have this warning, but sometimes it’s a poor match.</p>

<p>When working with single-precision floats, perhaps on games or graphics,
it’s easy to accidentally introduce promotion to double precision, which
can hurt performance. It could be neglecting an <code class="language-plaintext highlighter-rouge">f</code> suffix on a constant
or using <code class="language-plaintext highlighter-rouge">sin</code> instead of <code class="language-plaintext highlighter-rouge">sinf</code>. Use <strong><code class="language-plaintext highlighter-rouge">-﻿Wdouble-promotion</code></strong> to
catch such mistakes. Honestly, this is important enough that it should go
into the baseline.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PI 3.141592653589793
</span><span class="kt">float</span> <span class="n">degs</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">float</span> <span class="n">rads</span> <span class="o">=</span> <span class="n">degs</span> <span class="o">*</span> <span class="n">PI</span> <span class="o">/</span> <span class="mi">180</span><span class="p">;</span>  <span class="c1">// warns about promotion</span>
</code></pre></div></div>

<p>It can be awkward around variadic functions, particularly <code class="language-plaintext highlighter-rouge">printf</code>, which
cannot receive <code class="language-plaintext highlighter-rouge">float</code> arguments, and so implicitly converts. You’ll need
a explicit cast to disable the warning. I imagine this is the main reason
the warning is not part of <code class="language-plaintext highlighter-rouge">-﻿Wextra</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="p">...;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%.17g</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">x</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally, an advanced option: <strong><code class="language-plaintext highlighter-rouge">-﻿Wconversion -Wno-sign-conversion</code></strong>.
It warns about implicit conversions that may result in data loss. Sign
conversions do not have data loss, the implicit conversions are useful,
and in my experience they’re not a source of defects, so I disable that
part using the second flag (like MSVC <code class="language-plaintext highlighter-rouge">/wd4245</code>). The important warning
here is truncation of size values, warning about unsound uses of sizes and
subscripts. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: would be declared/defined via windows.h</span>
<span class="k">typedef</span> <span class="kt">uint32_t</span> <span class="n">DWORD</span><span class="p">;</span>
<span class="n">BOOL</span> <span class="nf">WriteFile</span><span class="p">(</span><span class="n">HANDLE</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="n">DWORD</span><span class="p">,</span> <span class="n">DWORD</span> <span class="o">*</span><span class="p">,</span> <span class="n">OVERLAPPED</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">err</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_ERROR_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">out</span><span class="p">;</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// len truncation warning</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On 64-bit targets, it will warn about truncating the 64-bit <code class="language-plaintext highlighter-rouge">len</code> for the
32-bit parameter. To dismiss the warning, you must either address it by
using a loop to <a href="/blog/2023/02/13/">call <code class="language-plaintext highlighter-rouge">WriteFile</code> multiple times</a>, or acknowledge the
truncation with an explicit cast and accept the consequences. In this case
I may know from context it’s impossible for the program to even construct
such a large message, so I’d use an assertion and truncate.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">err</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_ERROR_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">out</span><span class="p">;</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">len</span> <span class="o">&lt;=</span> <span class="mh">0xffffffff</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">)</span><span class="n">len</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You might consider changing the interface instead:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">logmsg</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">msg</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">len</span><span class="p">);</span>
</code></pre></div></div>

<p>That probably passes the buck and doesn’t solve the underlying problem.
The caller may be holding a <code class="language-plaintext highlighter-rouge">size_t</code> length, so the truncation happens
there instead. Or maybe you keep propagating this change backwards until
it, say, dissipates on a known constant. <code class="language-plaintext highlighter-rouge">-﻿Wconversion</code> leads to
these ripple effects that improves the overall program, which is why I
like it.</p>

<p>The catch is that the above warning only happens for 64-bit targets. So
you might miss it. The inverse is true in other cases. This is one area
where <a href="/blog/2021/08/21/">cross-architecture testing</a> can pay off.</p>

<p>Unfortunately since this warning is off the beaten path, it seems like it
doesn’t quite get the attention it could use. It warns about simple cases
where truncation has been explicitly handled/avoided. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="p">...;</span>
<span class="kt">char</span> <span class="n">digit</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">+</span> <span class="n">x</span><span class="o">%</span><span class="mi">10</span><span class="p">;</span>  <span class="c1">// false warning</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">'0'</code> is a known constant. The operation <code class="language-plaintext highlighter-rouge">x%10</code> has a known range (-9
to 9). Therefore the addition result has a known range, and all results
can be represented in a <code class="language-plaintext highlighter-rouge">char</code>. Yet it still warns. This often comes up
dealing with character data like this.</p>

<p>In my <code class="language-plaintext highlighter-rouge">logmsg</code> fix I had used an assertion to check that no truncation
actually occurred. But wouldn’t it be nice if the compiler could generate
that for us somehow? That brings us to dynamic checks.</p>

<h3 id="dynamic-run-time-checks">Dynamic run-time checks</h3>

<p>Sanitizers have been around for nearly a decade but are still criminally
underused. They insert run-time assertions into programs at the flip of a
switch typically at a modest performance cost — less than the cost of a
debug build. All three major compilers support at least one sanitizer on
all targets. In most cases, failing to use them is practically the same as
not even trying to find defects. Every beginner tutorial ought to be using
sanitizers <em>from page 1</em> where they teach how to compile a program with
<code class="language-plaintext highlighter-rouge">gcc</code>. (That this is universally <em>not</em> the case, and that these same
tutorials also do not begin with teaching a debugger, is a major, on-going
education failure.)</p>

<p>There are multiple different sanitizers with lots of overlap, but Address
Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most
general. They are compatible with each other and form a solid, general
baseline. To use address sanitizer, at both compile and link time do:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=address ...
</code></pre></div></div>

<p>It’s even spelled the same way in MSVC. It’s needed at link time because
it includes a runtime component. When working properly it’s aware of all
allocations and checks all memory accesses that might be out of bounds,
producing a run-time error if that occurs. It’s not always appropriate,
but most projects that can use it probably should.</p>

<p>UBSan is enabled similarly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=undefined ...
</code></pre></div></div>

<p>It adds checks around operations that might be undefined, emitting a
run-time error if it occurs. It has an optional runtime component to
produce a helpful diagnostic. You can instead insert a trap instruction,
which is how I prefer to use it: <strong><code class="language-plaintext highlighter-rouge">-﻿fsanitize-trap=undefined</code></strong>.
(Until recently it was <strong><code class="language-plaintext highlighter-rouge">-﻿fsanitize-undefined-trap-on-error</code></strong>.)
This works on platforms where the UBSan runtime is unsupported. Some
instrumentation is only inserted at higher optimization levels.</p>

<p>For me, the most useful UBSan check is signed overflow — e.g. computing
the wrong result — and it’s instrumentation I miss when not working in C.
In programs where this might be an issue, combine it <a href="/blog/2019/01/25/">with a fuzzer</a>
to search for inputs that cause overflows. This is yet another argument in
favor of <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">signed sizes</a>, as UBSan can detect such overflows. (Yes,
UBSan optionally instruments unsigned overflow, too, but then you must
somehow distinguish <a href="/blog/2019/11/19/">intentional</a> from <a href="/blog/2017/07/19/">unintentional</a>
overflow.)</p>

<p>On Linux, ASan and UBSan strangely do not have <a href="/blog/2022/06/26/">debugger-oriented
defaults</a>. Fortunately that’s easy to address with a couple of
environment variables, which cause them to break on error instead of
uselessly exiting:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">ASAN_OPTIONS</span><span class="o">=</span><span class="nv">abort_on_error</span><span class="o">=</span>1:halt_on_error<span class="o">=</span>1
<span class="nb">export </span><span class="nv">UBSAN_OPTIONS</span><span class="o">=</span><span class="nv">abort_on_error</span><span class="o">=</span>1:halt_on_error<span class="o">=</span>1
</code></pre></div></div>

<p>Also, when compiling you can combine sanitizers like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=address,undefined ...
</code></pre></div></div>

<p>As of this writing, MSVC does not have UBSan, but it does have a similar
feature, <a href="https://learn.microsoft.com/en-us/cpp/build/reference/rtc-run-time-error-checks">run-time error checks</a>. Three sub-flags (<code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">s</code>, <code class="language-plaintext highlighter-rouge">u</code>)
enable different checks, and <strong><code class="language-plaintext highlighter-rouge">/RTCcsu</code></strong> turns them all on. The <code class="language-plaintext highlighter-rouge">c</code> flag
generates the assertion I had manually written with <code class="language-plaintext highlighter-rouge">-﻿Wconversion</code>,
and traps any truncation at run time. There’s nothing quite like this in
UBSan! It’s so extreme that it’s compatible with neither standard runtime
libraries (fortunately <a href="/blog/2023/02/11/">not a big deal</a>) nor with ASan.</p>

<p>Caveat: Explicit casts aren’t enough, you must actually truncate variables
using a mask in order to pass the check. For example, to accept truncation
in the <code class="language-plaintext highlighter-rouge">logmsg</code> function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">WriteFile</span><span class="p">(</span><span class="n">err</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="n">len</span><span class="o">&amp;</span><span class="mh">0xffffffff</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Thread Sanitizer (TSan) is occasionally useful for finding — or, more
often, <em>proving</em> the presence of — data races. It has a runtime component
and so must be used at compile time and link time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -fsanitize=thread ...
</code></pre></div></div>

<p>Unfortunately it only works in a narrow context. The target must use
pthreads, not C11 threads, OpenMP, nor <a href="/blog/2023/03/23/">direct cloning</a>. It must
only synchronize through code that was compiled with TSan. That means no
synchronization <a href="/blog/2022/10/03/">through system calls</a>, especially no futexes. Most
non-trivial programs do not meet the criteria.</p>

<h3 id="debug-information">Debug information</h3>

<p>Another common mistake in tutorials is using plain old <code class="language-plaintext highlighter-rouge">-﻿g</code> instead
of <strong><code class="language-plaintext highlighter-rouge">-﻿g3</code></strong> (read: “debug level 3”). That’s like using <code class="language-plaintext highlighter-rouge">-﻿O</code>
instead of <code class="language-plaintext highlighter-rouge">-﻿O3</code>. It adds a lot more debug information to the
output, particularly enums and macros. The extra information is useful and
you’re better off having it!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc ... -g3 ...
</code></pre></div></div>

<p>All the major build systems — CMake, Autotools, Meson, etc. — get this
wrong in their standard debug configurations. Producing a fully-featured
debug build from these systems is a constant battle for me. Often it’s
easier to ignore the build system entirely and <code class="language-plaintext highlighter-rouge">cc -g3 **/*.c</code> (plus
sanitizers, etc.).</p>

<p>(Short term note: GCC 11, released in March 2021, switched to DWARF5 by
default. However, GDB could not access the extra <code class="language-plaintext highlighter-rouge">-﻿g3</code> debug
information in DWARF5 until GDB 13, released February 2023. If you have a
toolchain from that two year window — except <a href="https://github.com/skeeto/w64devkit">mine</a> because I patched
it — then you may also need <code class="language-plaintext highlighter-rouge">-﻿gdwarf-4</code> to switch back to DWARF4.)</p>

<p>What about <code class="language-plaintext highlighter-rouge">-﻿Og</code>? In theory it enables optimizations that do not
interfere with debugging, and potentially some additional warnings. In
practice I still get far too many “optimized out” messages from GDB when I
use it, so I don’t bother. Fortunately C is such a simple language that
debug builds are nearly as fast as release builds anyway.</p>

<p>On MSVC I like having debug information embedded in binaries, as GCC does,
which is done using <strong><code class="language-plaintext highlighter-rouge">/Z7</code></strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl ... /Z7 ...
</code></pre></div></div>

<p>Though I certainly understand the value of separate debug information,
<code class="language-plaintext highlighter-rouge">/Zi</code>, in some cases. Sometimes I wish the GNU toolchain made this easier.</p>

<h3 id="summary">Summary</h3>

<p>My personal rigorous baseline for development using <code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code>
looks like this (all platforms):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
     -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
     -fsanitize=undefined -fsanitize-trap ...
</code></pre></div></div>

<p>While ASan is great for quickly reviewing and evaluating other people’s
projects, I don’t find it useful for my own programs. I avoid that class
of defects through smarter paradigms (region-based allocation, no null
terminated strings, etc.). I also prefer the behavior of trap instruction
UBSan versus a diagnostic, as it behaves better under debuggers.</p>

<p>For <code class="language-plaintext highlighter-rouge">cl</code> and <code class="language-plaintext highlighter-rouge">clang-cl</code>, my personal baseline looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...
</code></pre></div></div>

<p>I don’t normally need <code class="language-plaintext highlighter-rouge">/D_CRT_SECURE_NO_WARNINGS</code> since I don’t use a CRT
anyway.</p>

<p><strong>Update</strong>: Peter0x44 points out <code class="language-plaintext highlighter-rouge">-D_GLIBCXX_DEBUG</code> if you’re working in
C++ with libstdc++, including on Windows with Mingw-w64. I agree, this is
an excellent option! ASan does not “see” C++ containers, and it fills in
some of those gaps.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Practical libc-free threading on Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/03/23/"/>
    <id>urn:uuid:631a8107-2eef-420b-9594-752e6f013048</id>
    <updated>2023-03-23T05:32:41Z</updated>
    <category term="c"/><category term="optimization"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re <a href="/blog/2023/02/15/">not using a C runtime</a> on Linux, and instead you’re
programming against its system call API. It’s long-term and stable after
all. <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">Memory management</a> and <a href="/blog/2023/02/13/">buffered I/O</a> are easily
solved, but a lot of software benefits from concurrency. It would be nice
to also have thread spawning capability. This article will demonstrate a
simple, practical, and robust approach to spawning and managing threads
using only raw system calls. It only takes about a dozen lines of C,
including a few inline assembly instructions.</p>

<p>The catch is that there’s no way to avoid using a bit of assembly. Neither
the <code class="language-plaintext highlighter-rouge">clone</code> nor <code class="language-plaintext highlighter-rouge">clone3</code> system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architectures supported by Linux. The <a href="https://man7.org/linux/man-pages/man2/clone.2.html">glibc <code class="language-plaintext highlighter-rouge">clone(2)</code>
wrapper</a> fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.</p>

<p>Before diving in, the complete, working demo: <a href="https://github.com/skeeto/scratch/blob/master/misc/stack_head.c"><strong><code class="language-plaintext highlighter-rouge">stack_head.c</code></strong></a></p>

<h3 id="the-clone-system-call">The clone system call</h3>

<p>On Linux, threads are spawned using the <code class="language-plaintext highlighter-rouge">clone</code> system call with semantics
like the classic unix <code class="language-plaintext highlighter-rouge">fork(2)</code>. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">clone</span><span class="p">(</span><span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Sounds kind of elegant, but it has an annoying problem: The new thread
begins life in the <em>middle</em> of a function without any established stack
frame. Its stack is a blank slate. It’s not ready to do anything except
jump to a function prologue that will set up a stack frame. So besides the
assembly for the system call itself, it also needs more assembly to get
the thread into a C-compatible state. In other words, <strong>a generic system
call wrapper cannot reliably spawn threads</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">brokenclone</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">threadentry</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_clone</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">stack</span><span class="p">);</span>
    <span class="c1">// DANGER: new thread may access non-existant stack frame here</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">threadentry</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For odd historical reasons, each architecture’s <code class="language-plaintext highlighter-rouge">clone</code> has a slightly
different interface. The newer <code class="language-plaintext highlighter-rouge">clone3</code> unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.</p>

<h3 id="the-stack-header">The stack “header”</h3>

<p>I <a href="/blog/2015/05/15/">figured out a neat trick eight years ago</a> which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the <code class="language-plaintext highlighter-rouge">ret</code> instruction
do the work.</p>

<p>Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the <code class="language-plaintext highlighter-rouge">stack_head</code>,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own <code class="language-plaintext highlighter-rouge">stack_head</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using <code class="language-plaintext highlighter-rouge">sizeof</code>
to place the structure, as I’ll demonstrate later.</p>

<p>Now for the cool part: The <code class="language-plaintext highlighter-rouge">...</code> can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">__attribute</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">16</span><span class="p">)))</span> <span class="n">stack_head</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">entry</span><span class="p">)(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">join_futex</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="err">}</span><span class="p">;</span>
</code></pre></div></div>

<p>More details on that futex shortly.</p>

<h3 id="the-clone-wrapper">The clone wrapper</h3>

<p>I call the <code class="language-plaintext highlighter-rouge">clone</code> wrapper <code class="language-plaintext highlighter-rouge">newthread</code>. It has the inline assembly for the
system call, and since it includes a <code class="language-plaintext highlighter-rouge">ret</code> to diverge the threads, it’s a
“naked” function <a href="/blog/2023/02/12/">just like with <code class="language-plaintext highlighter-rouge">setjmp</code></a>. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like <code class="language-plaintext highlighter-rouge">clone</code>, it doesn’t accept a thread
entry point. Instead it accepts a <code class="language-plaintext highlighter-rouge">stack_head</code> seeded with the entry
point. The whole wrapper is just six instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">long</span> <span class="nf">newthread</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"mov  %%rdi, %%rsi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// arg2 = stack</span>
        <span class="s">"mov  $0x50f00, %%edi</span><span class="se">\n</span><span class="s">"</span>  <span class="c1">// arg1 = clone flags</span>
        <span class="s">"mov  $56, %%eax</span><span class="se">\n</span><span class="s">"</span>       <span class="c1">// SYS_clone</span>
        <span class="s">"syscall</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  %%rsp, %%rdi</span><span class="se">\n</span><span class="s">"</span>     <span class="c1">// entry point argument</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="o">:</span> <span class="o">:</span> <span class="s">"rax"</span><span class="p">,</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"rsi"</span><span class="p">,</span> <span class="s">"rdi"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x86-64, both function calls and system calls use <code class="language-plaintext highlighter-rouge">rdi</code> and <code class="language-plaintext highlighter-rouge">rsi</code> for
their first two parameters. Per the reference <code class="language-plaintext highlighter-rouge">clone(2)</code> prototype above:
the first system call argument is <code class="language-plaintext highlighter-rouge">flags</code> and the second argument is the
new <code class="language-plaintext highlighter-rouge">stack</code>, which will point directly at the <code class="language-plaintext highlighter-rouge">stack_head</code>. However, the
stack pointer arrives in <code class="language-plaintext highlighter-rouge">rdi</code>. So I copy <code class="language-plaintext highlighter-rouge">stack</code> into the second argument
register, <code class="language-plaintext highlighter-rouge">rsi</code>, then load the flags (<code class="language-plaintext highlighter-rouge">0x50f00</code>) into the first argument
register, <code class="language-plaintext highlighter-rouge">rdi</code>. The system call number goes in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<p>Where does that <code class="language-plaintext highlighter-rouge">0x50f00</code> come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">long</span> <span class="n">flags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FILES</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_FS</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SIGHAND</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_SYSVSEM</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_THREAD</span><span class="p">;</span>
    <span class="n">flags</span> <span class="o">|=</span> <span class="n">CLONE_VM</span><span class="p">;</span>
</code></pre></div></div>

<p>When the system call returns, it copies the stack pointer into <code class="language-plaintext highlighter-rouge">rdi</code>, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as <code class="language-plaintext highlighter-rouge">stack</code>, of course. In the old thread this is a
harmless no-op because <code class="language-plaintext highlighter-rouge">rdi</code> is a volatile register in this ABI. Finally,
<code class="language-plaintext highlighter-rouge">ret</code> pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (<a href="/blog/2016/09/23/">negative errno</a>) or the new thread ID. In the new thread
<strong>it pops the first element of <code class="language-plaintext highlighter-rouge">stack_head</code></strong> which, of course, is the
entry point. That’s why it must be first!</p>

<p>The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the <code class="language-plaintext highlighter-rouge">exit</code> (<em>not</em> <code class="language-plaintext highlighter-rouge">exit_group</code>)
system call to terminate itself.</p>

<h3 id="caller-point-of-view">Caller point of view</h3>

<p>The caller side looks something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">threadentry</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do work ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
    <span class="n">futex_wake</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">);</span>
    <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="n">stack</span> <span class="o">=</span> <span class="n">newstack</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">);</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">entry</span> <span class="o">=</span> <span class="n">threadentry</span><span class="p">;</span>
    <span class="c1">// ... assign other thread data ...</span>
    <span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">newthread</span><span class="p">(</span><span class="n">stack</span><span class="p">);</span>

    <span class="c1">// ... do work ...</span>

    <span class="n">futex_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">join_futex</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Despite the minimalist, 6-instruction clone wrapper, this is taking the
shape of a conventional threading API. It would only take a bit more to
hide the futex, too. Speaking of which, what’s going on there? The <a href="/blog/2022/10/05/">same
principal as a WaitGroup</a>. The futex, an integer, is zero-initialized,
indicating the thread is running (“not done”). The joiner tells the kernel
to wait until the integer is non-zero, which it may already be since I
don’t bother to check first. When the child thread is done, it atomically
sets the futex to non-zero and wakes all waiters, which might be nobody.</p>

<p>Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its <code class="language-plaintext highlighter-rouge">SIGCHLD</code> (or use <code class="language-plaintext highlighter-rouge">CLONE_CHILD_CLEARTID</code>). If this
sounds like a problem, consider <a href="https://vimeo.com/644068002">your context</a> more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.</p>

<p>What’s with the <code class="language-plaintext highlighter-rouge">force_align_arg_pointer</code>? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, <a href="/blog/2023/02/15/#stack-alignment-on-32-bit-x86">just like on Windows</a>.
If you want to access <code class="language-plaintext highlighter-rouge">argc</code>, <code class="language-plaintext highlighter-rouge">argv</code>, and <code class="language-plaintext highlighter-rouge">envp</code> you’ll need <a href="/blog/2022/02/18/">more
assembly</a>. (I wish doing <em>really basic things</em> without libc on Linux
didn’t require so much assembly.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__asm</span> <span class="p">(</span>
    <span class="s">".global _start</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"_start:</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  (%rsp), %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsp), %rsi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   lea   8(%rsi,%rdi,8), %rdx</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   call  main</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  %eax, %edi</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   movl  $60, %eax</span><span class="se">\n</span><span class="s">"</span>
    <span class="s">"   syscall</span><span class="se">\n</span><span class="s">"</span>
<span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">envp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Getting back to the example usage, it has some regular-looking system call
wrappers. Where do those come from? Start with this 6-argument generic
system call wrapper.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">syscall6</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">long</span> <span class="n">d</span><span class="p">,</span> <span class="kt">long</span> <span class="n">e</span><span class="p">,</span> <span class="kt">long</span> <span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r10</span> <span class="n">asm</span><span class="p">(</span><span class="s">"r10"</span><span class="p">)</span> <span class="o">=</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r8</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r8"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="k">register</span> <span class="kt">long</span> <span class="n">r9</span>  <span class="n">asm</span><span class="p">(</span><span class="s">"r9"</span><span class="p">)</span>  <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"S"</span><span class="p">(</span><span class="n">b</span><span class="p">),</span> <span class="s">"d"</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r10</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r8</span><span class="p">),</span> <span class="s">"r"</span><span class="p">(</span><span class="n">r9</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"rcx"</span><span class="p">,</span> <span class="s">"r11"</span><span class="p">,</span> <span class="s">"memory"</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I could define <code class="language-plaintext highlighter-rouge">syscall5</code>, <code class="language-plaintext highlighter-rouge">syscall4</code>, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
</span></code></pre></div></div>

<p>Now we can have some exits:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>

<span class="n">__attribute</span><span class="p">((</span><span class="n">noreturn</span><span class="p">))</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">exit_group</span><span class="p">(</span><span class="kt">int</span> <span class="n">status</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL1</span><span class="p">(</span><span class="n">SYS_exit_group</span><span class="p">,</span> <span class="n">status</span><span class="p">);</span>
    <span class="n">__builtin_unreachable</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Simplified futex wrappers:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">expect</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL4</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">expect</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">futex_wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">futex</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SYSCALL3</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">futex</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="mh">0x7fffffff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And so on.</p>

<p>Finally I can talk about that <code class="language-plaintext highlighter-rouge">newstack</code> function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="nf">newstack</span><span class="p">(</span><span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">p</span> <span class="o">=</span> <span class="n">SYSCALL6</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x22</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="kt">long</span> <span class="n">count</span> <span class="o">=</span> <span class="n">size</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack_head</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span> <span class="o">+</span> <span class="n">count</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">aligned</code> attribute comes into play here: I treat the result like an
array of <code class="language-plaintext highlighter-rouge">stack_head</code> and return the last element. The attribute ensures
each individual elements is aligned.</p>

<p>That’s it! There’s not much to it other than a few thoughtful assembly
instructions. It took doing this a few times in a few different programs
before I noticed how simple it can be.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>CRT-free in 2023: tips and tricks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/15/"/>
    <id>urn:uuid:025441bf-084e-4c3e-9a37-269e2ac1a4d6</id>
    <updated>2023-02-15T02:12:00Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Seven years ago I wrote about <a href="/blog/2016/01/31/">“freestanding” Windows executables</a>.
After an additional seven years of practical experience both writing and
distributing such programs, half using <a href="https://github.com/skeeto/w64devkit">a custom-built toolchain</a>,
it’s time to revisit these cabalistic incantations and otherwise scant
details. I’ve tweaked my older article over the years as I’ve learned, but
this is a full replacement and does not assumes you’ve read it. The <a href="/blog/2023/02/11/">“why”
has been covered</a> and the focus will be on the “how”. Both the GNU
and MSVC toolchains will be considered.</p>

<p>I no longer call these “freestanding” programs since that term is, at
best, <a href="https://github.com/ipxe/ipxe/commit/e8393c372">inaccurate</a>. In fact, we will be actively avoiding GCC
features associated with that label. Instead I call these <em>CRT-free</em>
programs, where CRT stands for the <em>C runtime</em> the Windows-oriented term
for <em>libc</em>. This term communicates both intent and scope.</p>

<h3 id="entry-point">Entry point</h3>

<p>You should already know that <code class="language-plaintext highlighter-rouge">main</code> is not the program’s entry point, but
a C application’s entry point. The CRT provides the entry point, where it
initializes the CRT, including <a href="/blog/2022/02/18/">parsing command line options</a>, then
calls the application’s <code class="language-plaintext highlighter-rouge">main</code>. The real entry point doesn’t have a name.
It’s just the address of the function to be called by the loader without
arguments.</p>

<p>You might naively assume you could continue using the name <code class="language-plaintext highlighter-rouge">main</code> and tell
the linker to use it as the entry point. You would be wrong. <strong>Avoid the
name <code class="language-plaintext highlighter-rouge">main</code>!</strong> It has a special meaning in C gets special treatment. Using
it without a conventional CRT will confuse your tools an may cause build
issues.</p>

<p>While you can use almost any other name you like, the conventional names
are <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> (console subsystem) and <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code> (windows
subsystem). It’s easy to remember: Append <code class="language-plaintext highlighter-rouge">CRTStartup</code> to the name you’d
use in a normal CRT-linking application. I strongly recommend using these
names because it reduces friction. Your tools are already familiar with
them, so you won’t need to do anything special.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>     <span class="c1">// console subsystem</span>
<span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>  <span class="c1">// windows subsystem</span>
</code></pre></div></div>

<p>The MSVC linker documentation says the entry point uses the <code class="language-plaintext highlighter-rouge">__stdcall</code>
calling convention. <del>Ignore this and <strong>do not use <code class="language-plaintext highlighter-rouge">__stdcall</code> for your
entry point!</strong></del> Since entry points may take no arguments, there is no
practical difference from the <code class="language-plaintext highlighter-rouge">__cdecl</code> calling convention, so it matters
little. <del>Rather, the goal is to avoid <code class="language-plaintext highlighter-rouge">__stdcall</code> <em>function decorations</em>.
In particular, the GNU linker <code class="language-plaintext highlighter-rouge">--entry</code> option does not understand them,
nor can it find decorated entry points on its own. If you use <code class="language-plaintext highlighter-rouge">__stdcall</code>,
then the 32-bit GNU linker will silently (!) choose the beginning of your
<code class="language-plaintext highlighter-rouge">.text</code> section as the entry point.</del> (This bug was fixed in Binutils
2.42, released January 2024. <code class="language-plaintext highlighter-rouge">__stdcall</code> entry points now link correctly.)</p>

<p>If you’re using C++, then of course you will also need to use <code class="language-plaintext highlighter-rouge">extern "C"</code>
so that it’s not name-mangled. Otherwise the results are similarly bad.</p>

<p>If using <code class="language-plaintext highlighter-rouge">-fwhole-program</code>, you will need to mark your entry point as
externally visible for GCC so that it knows its an entry point. While
linkers are familiar with conventional entry point names, GCC the
<em>compiler</em> is not. Normally you do not need to worry about this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>  <span class="c1">// for -fwhole-program</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The entry point returns <code class="language-plaintext highlighter-rouge">int</code>. <em>If there are no other threads</em> then the
process will exit with the returned value as its exit status. In practice
this is only useful for console programs. Windows subsystem programs have
threads started automatically, without warning, and it’s almost certain
your main thread is not the last thread. You probably want to use
<code class="language-plaintext highlighter-rouge">ExitProcess</code> or even <code class="language-plaintext highlighter-rouge">TerminateProcess</code> instead of returning. The latter
exits more abruptly and can avoid issues with certain subsystems, like
DirectSound, not shutting down gracefully: It doesn’t even let them try.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">TerminateProcess</span><span class="p">(</span><span class="n">GetCurrentProcess</span><span class="p">(),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="compilation">Compilation</h3>

<p>Starting with the GNU toolchain, you have two ways to get into “CRT-free
mode”: <code class="language-plaintext highlighter-rouge">-nostartfiles</code> and <code class="language-plaintext highlighter-rouge">-nostdlib</code>. The former is more dummy-proof,
and it’s what I use in build documentation. The latter can be a more
complicated, but when it succeeds you get guarantees about the result. I
use it in build scripts I intend to run myself, which I want to fail if
they don’t do exactly what I expect. To illustrate, consider this trivial
program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ExitProcess</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This program uses <code class="language-plaintext highlighter-rouge">ExitProcess</code> from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>. Compiling is easy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-nostartfiles</code> prevents it from linking the CRT entry point, but it
still implicitly passes other “standard” linker flags, including libraries
<code class="language-plaintext highlighter-rouge">-lmingw32</code> and <code class="language-plaintext highlighter-rouge">-lkernel32</code>. Programs can use <code class="language-plaintext highlighter-rouge">kernel32.dll</code> functions
without explicitly linking that DLL. But, hey, isn’t <code class="language-plaintext highlighter-rouge">-lmingw32</code> the CRT,
the thing we’re avoiding? It is, but it wasn’t actually linked because the
program didn’t reference it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll
</code></pre></div></div>

<p>However, <code class="language-plaintext highlighter-rouge">-nostdlib</code> does not pass any of these libraries, so you need to
do so explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lkernel32
</code></pre></div></div>

<p>The MSVC toolchain behaves a little like <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, not linking a
CRT unless you need it, semi-automatically. However, you’ll need to list
<code class="language-plaintext highlighter-rouge">kernel32.dll</code> and tell it which subsystem you’re using.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console kernel32.lib
</code></pre></div></div>

<p>However, MSVC has a handy little feature to list these arguments in the
source file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _MSC_VER
</span>  <span class="cp">#pragma comment(linker, "/subsystem:console")
</span>  <span class="cp">#pragma comment(lib, "kernel32.lib")
#endif
</span></code></pre></div></div>

<p>This information must go somewhere, and I prefer the source file rather
than a build script. Then anyone can point MSVC at the source without
worrying about options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c
</code></pre></div></div>

<p>I try to make all my Windows programs so simply built.</p>

<h3 id="stack-probes">Stack probes</h3>

<p>On Windows, it’s expected that stacks will commit dynamically. That is,
the stack is merely <em>reserved</em> address space, and it’s only committed when
the stack actually grows into it. This made sense 30 years ago as a memory
saving technique, but today it no longer makes sense. However, programs
are still built to use this mechanism.</p>

<p>To function properly, programs must touch each stack page for the first
time in order. Normally that’s not an issue, but if your stack frame
exceeds the page size, there’s a chance it might step over a page. When a
function has a large stack frame, GCC inserts a call to a “stack probe” in
<code class="language-plaintext highlighter-rouge">libgcc</code> that touches its pages in the prologue. It’s not unlike <a href="/blog/2017/06/21/">stack
clash protection</a>.</p>

<p>For example, if I have a 4kiB local variable:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">12</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When I compile with <code class="language-plaintext highlighter-rouge">-nostdlib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'
</code></pre></div></div>

<p>It’s trying to link the CRT stack probe. You can disable this behavior
with <code class="language-plaintext highlighter-rouge">-mno-stack-arg-probe</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -nostdlib example.c
</code></pre></div></div>

<p>Or you can just link <code class="language-plaintext highlighter-rouge">-lgcc</code> to provide a definition:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib example.c -lgcc
</code></pre></div></div>

<p>Had you used <code class="language-plaintext highlighter-rouge">-nostartfiles</code>, you wouldn’t have noticed because it passes
<code class="language-plaintext highlighter-rouge">-lgcc</code> automatically. It’s “dummy-proof” because this sort of issue goes
away before it comes up, though for the same reason it’s harder to tell
exactly what went into a program.</p>

<p>If you disable the probe altogether — my preference — you’ve only solved
the linker problem, but the underlying stack commit problem remains and
your program may crash. You can solve that by telling the linker to ask
the loader to commit a larger stack up front rather than grow it at run
time. Say, 2MiB:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c
</code></pre></div></div>

<p>Of course, I wish that this was simply the default behavior because it’s
far more sensible! A much better option is to avoid large stack frames in
the first place. Allocate locals larger than, say, 1KiB in a scratch arena
instead of on the stack.</p>

<p>MSVC doesn’t have <code class="language-plaintext highlighter-rouge">libgcc</code> of course, but it still generates stack probes
both for growing the stack and for security checks. The latter requires
<code class="language-plaintext highlighter-rouge">kernel32.dll</code>, so if I compile the same program with MSVC, I get a bunch
of linker failures:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...
</code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">/Gs1000000000</code> turns off the stack probes, <code class="language-plaintext highlighter-rouge">/GS-</code> turns off the
checks, <code class="language-plaintext highlighter-rouge">/stack</code> commits a larger stack:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000
</code></pre></div></div>

<p>Though, as before, better to avoid large stack frames in the first place.</p>

<h3 id="built-in-functions-ugh">Built-in functions… ugh</h3>

<p>The three major C and C++ compilers — GCC, MSVC, Clang — share a common,
evil weakness: “built-in” functions. <em>No matter what</em>, they each assume
you will supply definitions for standard string functions at link time,
particularly <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memcpy</code>. They do this no matter how many
“seriously now, do not use standard C functions” options you pass. When
you don’t link a CRT, you may need to define them yourself.</p>

<p>With GCC there’s a catch: it will transform your <code class="language-plaintext highlighter-rouge">memset</code> definition —
that is, <em>in a function named <code class="language-plaintext highlighter-rouge">memset</code></em> — into a call to itself. After
all, it looks an awful lot like <code class="language-plaintext highlighter-rouge">memset</code>! This typically manifests as an
infinite loop. <strong>Use <code class="language-plaintext highlighter-rouge">-fno-builtin</code> to prevent GCC from mis-compiling
built-in functions.</strong></p>

<p>Even with <code class="language-plaintext highlighter-rouge">-fno-builtin</code>, both GCC and Clang will continue inserting calls
to built-in functions elsewhere. For example, making an especially large
local variable (and using <code class="language-plaintext highlighter-rouge">volatile</code> to prevent it from being optimized
out):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the latest GCC and Clang will generate a <code class="language-plaintext highlighter-rouge">memset</code> call
despite <code class="language-plaintext highlighter-rouge">-fno-builtin</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...
</code></pre></div></div>

<p>To be absolutely pure, you will need to address this in just about any
non-trivial program. On the other hand, <code class="language-plaintext highlighter-rouge">-nostartfiles</code> will grab a
definition from <code class="language-plaintext highlighter-rouge">msvcrt.dll</code> for you:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll
</code></pre></div></div>

<p>To be clear, <em>this is a completely legitimate and pragmatic route!</em> You
get the benefits of both worlds: the CRT is still out of the way, but
there’s also no hassle from misbehaving compilers. If this sounds like a
good deal, then do it! (For on-lookers feeling smug: there is no such
easy, general solution for this problem on Linux.)</p>

<p>When you write your own definitions, I suggest putting each definition in
its own section so that they can be discarded via <code class="language-plaintext highlighter-rouge">-Wl,--gc-sections</code> when
unused:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">section</span><span class="p">(</span><span class="s">".text.memset"</span><span class="p">)))</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">memset</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far, for all three compilers, I’ve only needed to provide definitions
for <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">memcpy</code>.</p>

<h3 id="stack-alignment-on-32-bit-x86">Stack alignment on 32-bit x86</h3>

<p>GCC expects a 16-byte aligned stack and generates code accordingly. Such
is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However,
the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal
with it, there will likely be unaligned loads. Some may not be valid (e.g.
SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a
function attribute for this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>GCC will now align the stack in this function’s prologue. Adjustment is
only necessary at entry points, as GCC will maintain alignment through its
own frames. This includes <em>all</em> entry points, not just the program entry
point, particularly thread start functions. Rule of thumb for i686 GCC:
<strong>If <code class="language-plaintext highlighter-rouge">WINAPI</code> or <code class="language-plaintext highlighter-rouge">__stdcall</code> appears in a definition, the stack likely
requires alignment</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="n">DWORD</span> <span class="n">WINAPI</span> <span class="nf">mythread</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s harmless to use this attribute on x64. The prologue will just be a
smidge larger. If you’re worried about it, use <code class="language-plaintext highlighter-rouge">#ifdef __i686__</code> to limit
it to 32-bit builds.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>If I’ve written a graphical application with <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>, used
large stack frames, marked my entry point as externally visible, plan to
support 32-bit builds, and defined a couple of needed string functions, my
optimal entry point may look something like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef __GNUC__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">externally_visible</span><span class="p">))</span>
<span class="cp">#endif
#ifdef __i686__
</span><span class="n">__attribute</span><span class="p">((</span><span class="n">force_align_arg_pointer</span><span class="p">))</span>
<span class="cp">#endif
</span><span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then my “optimize all the things” release build may look something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32
</code></pre></div></div>

<p>Or with MSVC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows
</code></pre></div></div>

<p>Or if I’m taking it easy maybe just:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c
</code></pre></div></div>

<p>Or with MSVC (linker flags in source):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /O2 app.c
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Let's implement buffered, formatted output</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/13/"/>
    <id>urn:uuid:4a4af83f-4fd8-4b3b-99aa-089d01f90fad</id>
    <updated>2023-02-13T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://old.reddit.com/r/C_Programming/comments/111238u/lets_implement_buffered_formatted_output/">on reddit</a>.</em></p>

<p>When <a href="/blog/2023/02/11/">not using the C standard library</a>, how does one deal with
formatted output? Re-implementing the entirety of <code class="language-plaintext highlighter-rouge">printf</code> from scratch
seems like a lot of work, and indeed it would be. Fortunately it’s rarely
necessary. With the right mindset, and considering your program’s <em>actual</em>
formatting needs, it’s not as difficult as it might appear. Since it goes
hand-in-hand with buffering, I’ll cover both topics at once, including
<code class="language-plaintext highlighter-rouge">sprintf</code>-like capabilities, which is where we’ll start.</p>

<!--more-->

<h3 id="the-print-is-append-mindset">The print-is-append mindset</h3>

<p>Buffering amortizes the costs of write (and read) system calls. Many small
writes are queued via the buffer into a few large writes. This isn’t just
an implementation detail. It’s key in the mindset to tackle formatted
output: <strong>Printing is appending.</strong></p>

<p>The mindset includes the reverse: <em>Appending is like printing</em>. Consider
this next time you reach for <code class="language-plaintext highlighter-rouge">strcat</code> or similar. Is this the appropriate
destination for this data, or am I just going to print it — i.e. append it
to another, different buffer — afterward?</p>

<p>This concept may sound obvious, but consider that there are major, popular
programming paradigms where the norm is otherwise. I’ll pick on Python to
illustrate, but it’s not alone.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"found </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> items"</span><span class="p">)</span>
</code></pre></div></div>

<p>This line of code allocates a buffer; formats the value of the variable
<code class="language-plaintext highlighter-rouge">count</code> into it; allocates a second buffer; copies into it the prefix
(<code class="language-plaintext highlighter-rouge">"found "</code>), the first buffer, and the suffix (<code class="language-plaintext highlighter-rouge">" items"</code>); copies the
contents of this second buffer into the standard output buffer; then
discards the two temporary buffers. To see for yourself, use the <a href="https://docs.python.org/3/library/dis.html">CPython
bytecode disassembler</a> on it. (It <em>is</em> pretty neat that string
formatting is partially implemented in the compiler and partially parsed
at compile time.)</p>

<p>With the print-is-append mindset, you know it’s ultimately being copied
into the standard output buffer, and that you can skip the intermediate
appending and copying. Avoiding that pessimization isn’t just about the
computer’s time, it’s even more about saving your own time implementing
formatted output.</p>

<p>In C that line looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">printf</span><span class="p">(</span><span class="s">"found %d items</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>The format string is a domain-specific language (DSL) that is (usually)
parsed and evaluated at run time. In essence it’s a little program that
says:</p>

<ol>
  <li>Append <code class="language-plaintext highlighter-rouge">"found "</code> to the output buffer</li>
  <li>Format the given integer into the output buffer</li>
  <li>Append <code class="language-plaintext highlighter-rouge">" items\n"</code> to the output buffer</li>
</ol>

<p>For <code class="language-plaintext highlighter-rouge">sprintf</code> the output buffer is caller-supplied instead of a buffered
stream.</p>

<p>In this implementation we’re doing to skip the DSL and express such
“format programs” in C itself. It’s more verbose at the call site, but it
simplifies the implementation. As a bonus, it’s also faster since the
format program is itself compiled by the C compiler. In your own formatted
output implementation you could write a <code class="language-plaintext highlighter-rouge">printf</code> that, following the
format string, calls the append primitives we’ll build below.</p>

<h3 id="buffer-implementation">Buffer implementation</h3>

<p>Let’s begin by defining an output buffer. An output buffer tracks the
total capacity and how much has been written. I’ll include a sticky error
flag to simplify error checks. For a first pass we’ll start with a
<code class="language-plaintext highlighter-rouge">sprintf</code> rather than full-blown <code class="language-plaintext highlighter-rouge">printf</code> because there’s nowhere yet for
the data to go.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MEMBUF(buf, cap) {buf, cap, 0, 0}
</span><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">error</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">unsigned char</code> since these are <em>bytes</em>, best understood as
unsigned (0–255), particularly important when dealing with encodings. I
also wrote a “constructor” macro, <code class="language-plaintext highlighter-rouge">MEMBUF</code>, to help with initialization.
Next we need a function to append bytes — the core operation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">amount</span> <span class="o">=</span> <span class="n">avail</span><span class="o">&lt;</span><span class="n">len</span> <span class="o">?</span> <span class="n">avail</span> <span class="o">:</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">amount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">src</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="n">amount</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If there wasn’t room, it copies as much as possible and sets the error
flag to indicate truncation. It doesn’t return the error. Rather than
check after each append, the caller will check after multiple appends,
effectively batching the checks into one check. The typical, expected case
is that there is no error, so make that path fast.</p>

<p>Since it’s an easy point to miss: <code class="language-plaintext highlighter-rouge">append</code> is the only place in the entire
implementation where bounds checking comes into play. Everything else can
confidentially throw bytes at the buffer without worrying if it fits. If
it doesn’t, the sticky error flag will indicate such at a more appropriate
time.</p>

<p>I could have used <code class="language-plaintext highlighter-rouge">memcpy</code> for the loop, but the goal is not to use libc.
Besides, not using <code class="language-plaintext highlighter-rouge">memcpy</code> means we can pass a null pointer without
making it a special exception.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// append nothing (no-op)</span>
</code></pre></div></div>

<p>I expect that static strings are common sources for append, so I’ll add a
helper macro which gets the length as a compile-time constant. The null
terminator will not be used.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define APPEND_STR(b, s) append(b, s, sizeof(s)-1)
</span></code></pre></div></div>

<p>If that’s not clear yet, it will be once you see an example. It’s also
useful to append single bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_byte</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">c</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With primitive appends done, we can build ever “higher-level” appends. For
example, to append a formatted <code class="language-plaintext highlighter-rouge">long</code> to the buffer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_long</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tmp</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">tmp</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">tmp</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">end</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">t</span> <span class="o">=</span> <span class="n">x</span><span class="o">&gt;</span><span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">x</span> <span class="o">:</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">-</span> <span class="n">t</span><span class="o">%</span><span class="mi">10</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">t</span> <span class="o">/=</span> <span class="mi">10</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*--</span><span class="n">beg</span> <span class="o">=</span> <span class="sc">'-'</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">append</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">beg</span><span class="p">,</span> <span class="n">end</span><span class="o">-</span><span class="n">beg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By working from the negative end — recall that the negative range is
larger than the positive — it supports the full range of signed <code class="language-plaintext highlighter-rouge">long</code>,
whatever it happens to be on this host. With less than 50 lines of code we
now have enough to format the example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">message</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">buf</span> <span class="n">b</span> <span class="o">=</span> <span class="n">MEMBUF</span><span class="p">(</span><span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">message</span><span class="p">));</span>

<span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"found "</span><span class="p">);</span>
<span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"items</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">error</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// truncated</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can continue defining append functions for whatever types we need.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_ptr</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"0x"</span><span class="p">);</span>
    <span class="kt">uintptr_t</span> <span class="n">u</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="n">u</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"0123456789abcdef"</span><span class="p">[(</span><span class="n">u</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="p">))</span><span class="o">&amp;</span><span class="mi">15</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">vec2</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">;</span> <span class="p">};</span>

<span class="kt">void</span> <span class="nf">append_vec2</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="k">struct</span> <span class="n">vec2</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">"vec2{"</span><span class="p">);</span>
    <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">x</span><span class="p">);</span>
    <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">", "</span><span class="p">);</span>
    <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">v</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
    <span class="n">append_byte</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="sc">'}'</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Perhaps you want features like field width? Add a parameter for it… but
only if you need it!</p>

<h3 id="float-formatting">Float formatting</h3>

<p>As mentioned before, <a href="https://netlib.org/fp/dtoa.c">precise float formatting is challenging</a>
because it’s full of edge cases. However, if you only need to output a
simple format at reduced precision, it’s not difficult. To illustrate,
this nearly matches <code class="language-plaintext highlighter-rouge">%f</code>, built atop <code class="language-plaintext highlighter-rouge">append_long</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append_double</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">prec</span> <span class="o">=</span> <span class="mi">1000000</span><span class="p">;</span>  <span class="c1">// i.e. 6 decimals</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'-'</span><span class="p">);</span>
        <span class="n">x</span> <span class="o">=</span> <span class="o">-</span><span class="n">x</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">x</span> <span class="o">+=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="o">/</span> <span class="n">prec</span><span class="p">;</span>  <span class="c1">// round last decimal</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)(</span><span class="o">-</span><span class="mi">1UL</span><span class="o">&gt;&gt;</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>  <span class="c1">// out of long range?</span>
        <span class="n">APPEND_STR</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="s">"inf"</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="kt">long</span> <span class="n">integral</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
        <span class="kt">long</span> <span class="n">fractional</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">integral</span><span class="p">)</span><span class="o">*</span><span class="n">prec</span><span class="p">;</span>
        <span class="n">append_long</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">integral</span><span class="p">);</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'.'</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">prec</span><span class="o">/</span><span class="mi">10</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">/=</span> <span class="mi">10</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;</span> <span class="n">fractional</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">append_byte</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="sc">'0'</span><span class="p">);</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">append_long</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">fractional</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="output-to-a-handle">Output to a handle</h3>

<p>So far this writes output to a buffer and truncates when it runs out of
space. Usually we want this going to a sink, like a kernel object whether
that be a file, pipe, socket, etc. to which we have a handle like a file
descriptor. Instead of truncating, we <em>flush</em> the buffer to this sink, at
which point there’s room for more output. The error flag is set if the
flush fails, but this is essentially the same concept as before.</p>

<p>In these examples I will use a file descriptor <code class="language-plaintext highlighter-rouge">int</code>, but you can use
whatever sort of handle is appropriate. I’ll add an <code class="language-plaintext highlighter-rouge">fd</code> field to the
buffer and a new constructor macro:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MEMBUF(buf, cap) {buf, cap, 0, -1, 0}
#define FDBUF(fd, buf, cap) {buf, cap, 0, fd, 0}
</span>
<span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="n">Bool</span> <span class="n">error</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The buffered stream will be polymorphic: Output can go to a memory buffer
or to an operating system handle using the same append interface. This is
a handy feature standard C doesn’t even have, though POSIX does in the
form of <a href="https://man7.org/linux/man-pages/man3/fmemopen.3.html"><code class="language-plaintext highlighter-rouge">fmemopen</code></a>. Nothing else changes except <code class="language-plaintext highlighter-rouge">append</code>,
which, if given a valid handle, will flush when full. Attempting to flush
a memory buffer sets the error flag.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">os_write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">fd</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">&amp;&amp;</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">|=</span> <span class="o">!</span><span class="n">os_write</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve arranged so that output stops when there’s an error. Also I’m using a
hypothetical <code class="language-plaintext highlighter-rouge">os_write</code> in the platform layer as a full, unbuffered write.
Note that unix <code class="language-plaintext highlighter-rouge">write(2)</code> experiences partial writes and so must be used
in a loop. Win32 <code class="language-plaintext highlighter-rouge">WriteFile</code> doesn’t have partial writes, so on Windows an
<code class="language-plaintext highlighter-rouge">os_write</code> could pass its arguments directly to the operating system.</p>

<p>The program will need to call <code class="language-plaintext highlighter-rouge">flush</code> directly when it’s done writing
output, or to display output early, e.g. line buffering. In <code class="language-plaintext highlighter-rouge">append</code> we’ll
use a loop to continue appending and flushing until the input is consumed
or an error occurs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">append</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">src</span> <span class="o">+</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">error</span> <span class="o">&amp;&amp;</span> <span class="n">src</span><span class="o">&lt;</span><span class="n">end</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">left</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">src</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">amount</span> <span class="o">=</span> <span class="n">avail</span><span class="o">&lt;</span><span class="n">left</span> <span class="o">?</span> <span class="n">avail</span> <span class="o">:</span> <span class="n">left</span><span class="p">;</span>

        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">amount</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">src</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>
        <span class="n">src</span> <span class="o">+=</span> <span class="n">amount</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">amount</span> <span class="o">&lt;</span> <span class="n">left</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">flush</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That completes formatted output! We can now do stuff like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">mem</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">10</span><span class="p">];</span>  <span class="c1">// arbitrarily-chosen 1kB buffer</span>
    <span class="k">struct</span> <span class="n">buf</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">FDBUF</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">mem</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">mem</span><span class="p">));</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1000000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">APPEND_STR</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="s">"iteration "</span><span class="p">);</span>
        <span class="n">append_long</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="n">append_byte</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">,</span> <span class="sc">'\n'</span><span class="p">);</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>
    <span class="n">flush</span><span class="p">(</span><span class="o">&amp;</span><span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">stdout</span><span class="p">.</span><span class="n">error</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Except for the lack of format DSL, this should feel familiar.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Let's write a setjmp</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/12/"/>
    <id>urn:uuid:ab83cc5d-7877-4cba-98e4-d36059297ead</id>
    <updated>2023-02-12T02:23:11Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34760828">on Hacker News</a>.</em></p>

<p>Yesterday I wrote that <a href="/blog/2023/02/11/"><code class="language-plaintext highlighter-rouge">setjmp</code> is handy</a> and that it would be nice
to have without linking the C standard library. It’s conceptually simple,
after all. Today let’s explore some differently-portable implementation
possibilities with distinct trade-offs. At the very least it should
illuminate why <code class="language-plaintext highlighter-rouge">setjmp</code> sometimes requires the use of <code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<!--more-->

<p>First, a quick review: <code class="language-plaintext highlighter-rouge">setjmp</code> and <code class="language-plaintext highlighter-rouge">longjmp</code> are a form of <em>non-local
goto</em>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>Calling <code class="language-plaintext highlighter-rouge">setjmp</code> saves the execution context in a <code class="language-plaintext highlighter-rouge">jmp_buf</code>, and <code class="language-plaintext highlighter-rouge">longjmp</code>
restores this context, returning the thread to this previous point of
execution. This means <code class="language-plaintext highlighter-rouge">setjmp</code> returns twice: (1) after saving the
context, and (2) from <code class="language-plaintext highlighter-rouge">longjmp</code>. To distinguish these cases, the first
time it returns zero and the second time it returns the value passed to
<code class="language-plaintext highlighter-rouge">longjmp</code>.</p>

<p><code class="language-plaintext highlighter-rouge">jmp_buf</code> is an array of some platform-specific type and length. I’ll be
using void pointers in this article because it’s a register-sized type
that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal
addresses which eased in working it out.</p>

<h3 id="using-gcc-intrinsics">Using GCC intrinsics</h3>

<p>Let’s start with the easiest option. <a href="https://gcc.gnu.org/onlinedocs/gcc/Nonlocal-Gotos.html">GCC has two intrinsics</a> doing
all the hard work for us: <code class="language-plaintext highlighter-rouge">__builtin_setjmp</code> and <code class="language-plaintext highlighter-rouge">__builtin_longjmp</code>. Its
worst case <code class="language-plaintext highlighter-rouge">jmp_buf</code> is length 5, but the most popular architectures only
use the first 3 elements. Clang supports these intrinsics as well for GCC
compatibility.</p>

<p>Be mindful that the semantics are slightly different from the standard C
definition, namely that you cannot use <code class="language-plaintext highlighter-rouge">longjmp</code> from the same function as
<code class="language-plaintext highlighter-rouge">setjmp</code>. It also doesn’t touch the signal mask. However, it’s easier to
use and you don’t need to worry about <code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE to copy-pasters: semantics differ slightly from standard C</span>
<span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">5</span><span class="p">];</span>
<span class="cp">#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp
</span></code></pre></div></div>

<p>If you only care about GCC and/or Clang, then that’s it! It works as-is on
every supported target and nothing more is needed. As a bonus, it will be
more efficient than the libc version, though I should hope that won’t
matter in practice. These are so awesome and convenient that I’m already
second-guessing myself: “Do I <em>really</em> need to support other compilers…?”</p>

<h3 id="using-assembly">Using assembly</h3>

<p>If I want to support more compilers I’ll need to write it myself. It’s
also an excuse to dig into the details. The execution context is no more
than an array of saved registers, and <code class="language-plaintext highlighter-rouge">longjmp</code> is merely restoring those
registers. One of the registers is the instruction pointer, and setting
the instruction pointer is called a jump.</p>

<p>Since we’re talking about registers, that means assembly. We’ll also need
to know the target’s calling convention, so this really narrows things
down. This implementation will target x86-64, a.k.a x64, Windows, <em>but</em> it
will support MSVC as an additional compiler. So it’s a different kind of
portability. I’ll start with GCC via <a href="https://github.com/skeeto/w64devkit">w64devkit</a> then massage it into
something MSVC can use.</p>

<p>I mentioned before that <code class="language-plaintext highlighter-rouge">setjmp</code> returns twice. So to return a second time
we just need to <em>simulate</em> a normal function return. Obviously that
includes restoring the stack pointer like the <code class="language-plaintext highlighter-rouge">ret</code> instruction, but it
means preserving all the non-volatile registers a callee is supposed to
preserve. These will all go in the execution context.</p>

<p>The <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention">x64 calling convention</a> specifies 9 non-volatile <code class="language-plaintext highlighter-rouge">rsp</code>, <code class="language-plaintext highlighter-rouge">rsp</code>,
<code class="language-plaintext highlighter-rouge">rbx</code>, <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">r12</code>, <code class="language-plaintext highlighter-rouge">r13</code>, <code class="language-plaintext highlighter-rouge">r14</code>, and <code class="language-plaintext highlighter-rouge">r15</code>. We’ll also need the
instruction pointer, <code class="language-plaintext highlighter-rouge">rip</code>, making it 10 total.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
</code></pre></div></div>

<h4 id="setjmp-assembly">setjmp assembly</h4>

<p>The tricky issue is that we need to save the registers immediately inside
<code class="language-plaintext highlighter-rouge">setjmp</code> before the compiler has manipulated them in a function prologue.
That will take more than mere inline assembly. We’ll start with a <em>naked</em>
function, which means that GCC will not create a prologue or epilogue.
However, that means no local variables, and the function body will be
limited to inline assembly, including a <code class="language-plaintext highlighter-rouge">ret</code> instruction for the
epilogue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="c1">// ...</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The x64 calling convention uses <code class="language-plaintext highlighter-rouge">rcx</code> for the first pointer argument, so
that’s where we’ll find <code class="language-plaintext highlighter-rouge">buf</code>. I’ve arbitrarily decided to store <code class="language-plaintext highlighter-rouge">rip</code>
first, then the other registers in order. However, the current value of
<code class="language-plaintext highlighter-rouge">rip</code> isn’t the one we need. The <code class="language-plaintext highlighter-rouge">rip</code> we need was just pushed on top of
the stack by the caller. I’ll read that off the stack into a scratch
register, <code class="language-plaintext highlighter-rouge">rax</code>, and then store it in the first element of <code class="language-plaintext highlighter-rouge">buf</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>The stack pointer, <code class="language-plaintext highlighter-rouge">rsp</code>, is also indirect since I want the pointer just
before <code class="language-plaintext highlighter-rouge">rip</code> was pushed, as it would be just after a <code class="language-plaintext highlighter-rouge">ret</code>. I use a <code class="language-plaintext highlighter-rouge">lea</code>,
<em>load effective address</em>, to add 8 bytes (recall: stack grows down),
placing the result in a scratch register, then write it into the second
element of <code class="language-plaintext highlighter-rouge">buf</code> (i.e. 8 bytes into <code class="language-plaintext highlighter-rouge">%rcx</code>).</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">lea</span> <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Everything else is a matter of elbow grease.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rbp</span><span class="p">,</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rbx</span><span class="p">,</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rdi</span><span class="p">,</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nb">rsi</span><span class="p">,</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r12</span><span class="p">,</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r13</span><span class="p">,</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r14</span><span class="p">,</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
    <span class="nf">mov</span> <span class="o">%</span><span class="nv">r15</span><span class="p">,</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>With all work complete, return zero to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">xor</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Putting it altogether, and avoiding a <code class="language-plaintext highlighter-rouge">-Wunused-variable</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">,</span><span class="n">returns_twice</span><span class="p">))</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="s">"mov (%rsp), %rax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rax,  0(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"lea 8(%rsp), %rax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rax,  8(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rbp, 16(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rbx, 24(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rdi, 32(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %rsi, 40(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r12, 48(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r13, 56(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r14, 64(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %r15, 72(%rcx)</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"xor %eax, %eax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"ret</span><span class="se">\n</span><span class="s">"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also take note of the <code class="language-plaintext highlighter-rouge">returns_twice</code> attribute. It informs GCC of this
function’s unusual nature, saying the function <em>doesn’t</em> preserve most
non-volatile registers, and induces <code class="language-plaintext highlighter-rouge">-Wclobbered</code> diagnostics. Technically
this means we could get away with saving only <code class="language-plaintext highlighter-rouge">rip</code>, <code class="language-plaintext highlighter-rouge">rsp</code>, and <code class="language-plaintext highlighter-rouge">rbp</code> —
exactly as <code class="language-plaintext highlighter-rouge">__builtin_setjmp</code> does — but we’ll need the others for MSVC
anyway.</p>

<h4 id="longjmp-assembly">longjmp assembly</h4>

<p>In <code class="language-plaintext highlighter-rouge">longjmp</code> we need to restore all those registers. For purely aesthetic
reasons I’ve decided to do it in reverse order. Everything but <code class="language-plaintext highlighter-rouge">rip</code> is
easy.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r15</span>
    <span class="nf">mov</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r14</span>
    <span class="nf">mov</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r13</span>
    <span class="nf">mov</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r12</span>
    <span class="nf">mov</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsi</span>
    <span class="nf">mov</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rdi</span>
    <span class="nf">mov</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbx</span>
    <span class="nf">mov</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbp</span>
    <span class="nf">mov</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsp</span>
</code></pre></div></div>

<p>The instruction set doesn’t have direct access to <code class="language-plaintext highlighter-rouge">rip</code>. It will be a
<code class="language-plaintext highlighter-rouge">jmp</code> instead of <code class="language-plaintext highlighter-rouge">mov</code>, but before jumping we’ll need to prepare the
return value. The x64 calling convention says the second argument is
passed in <code class="language-plaintext highlighter-rouge">rdx</code>, so move that to <code class="language-plaintext highlighter-rouge">rax</code>, then <code class="language-plaintext highlighter-rouge">jmp</code> to the caller. It’s
only a 32-bit operand, C <code class="language-plaintext highlighter-rouge">int</code>, so <code class="language-plaintext highlighter-rouge">edx</code> instead of <code class="language-plaintext highlighter-rouge">rdx</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
    <span class="nf">jmp</span> <span class="o">*</span><span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Putting it all together, and adding the <code class="language-plaintext highlighter-rouge">noreturn</code> attribute:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">,</span><span class="n">noreturn</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">ret</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ret</span><span class="p">;</span>
    <span class="kr">__asm</span><span class="p">(</span>
        <span class="s">"mov 72(%rcx), %r15</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 64(%rcx), %r14</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 56(%rcx), %r13</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 48(%rcx), %r12</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 40(%rcx), %rsi</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 32(%rcx), %rdi</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 24(%rcx), %rbx</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov 16(%rcx), %rbp</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov  8(%rcx), %rsp</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"mov %edx, %eax</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"jmp *0(%rcx)</span><span class="se">\n</span><span class="s">"</span>
    <span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The C standard says that if <code class="language-plaintext highlighter-rouge">ret</code> is zero then <code class="language-plaintext highlighter-rouge">longjmp</code> will return 1
from <code class="language-plaintext highlighter-rouge">setjmp</code> instead. I leave that detail as a reader exercise. Otherwise
this is a complete, working <code class="language-plaintext highlighter-rouge">setjmp</code>. It works perfectly when I swap it in
for <code class="language-plaintext highlighter-rouge">setjmp.h</code> in <a href="https://github.com/skeeto/u-config/blob/master/test_main.c">my u-config test suite</a>.</p>

<h3 id="considering-volatile">Considering volatile</h3>

<p>Now that you’ve seen the guts, let’s talk about <code class="language-plaintext highlighter-rouge">volatile</code> and why it’s
necessary. Consider this function, <code class="language-plaintext highlighter-rouge">example</code>, which calls a <code class="language-plaintext highlighter-rouge">work</code>
function that may return through <code class="language-plaintext highlighter-rouge">setjmp</code> (e.g. on failure).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">work</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">example</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">buf</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// first return</span>
        <span class="n">r</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">work</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="c1">// second return</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It stores to <code class="language-plaintext highlighter-rouge">r</code> after the first <code class="language-plaintext highlighter-rouge">setjmp</code> return, then loads <code class="language-plaintext highlighter-rouge">r</code> after the
second <code class="language-plaintext highlighter-rouge">setjmp</code> return. However, <code class="language-plaintext highlighter-rouge">r</code> may have been stored in the execution
context. Since it’s used across function calls, it would be reasonable to
store this variable in non-volatile register like <code class="language-plaintext highlighter-rouge">ebx</code>. If so, it will be
restored to its value at the moment of the first call to <code class="language-plaintext highlighter-rouge">setbuf</code>, in
which case the <em>old</em> <code class="language-plaintext highlighter-rouge">r</code> would be read after restoration by <code class="language-plaintext highlighter-rouge">longjmp</code>. If
it’s not stored in a register, but on the stack, then on the second return
the function will read the latest value out of the stack. In practice, if
<code class="language-plaintext highlighter-rouge">work</code> returns through <code class="language-plaintext highlighter-rouge">longjmp</code>, this function may return either 0 or 1,
probably determined by the optimization level.</p>

<p>The solution is to qualify <code class="language-plaintext highlighter-rouge">r</code> with <code class="language-plaintext highlighter-rouge">volatile</code>, which forces the compiler
to store the variable on the stack and never cache it in a register.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">volatile</span> <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>Though since our <code class="language-plaintext highlighter-rouge">setbuf</code> is marked <code class="language-plaintext highlighter-rouge">returns_twice</code>, GCC will never store
<code class="language-plaintext highlighter-rouge">r</code> in a register across <code class="language-plaintext highlighter-rouge">setjmp</code> calls. This potentially hides a bug in
the program that would occur under some other compilers, but GCC will
(usually) warn about it.</p>

<h3 id="pure-assembly-and-msvc">Pure assembly and MSVC</h3>

<p>MSVC doesn’t understand <code class="language-plaintext highlighter-rouge">__attribute__</code> nor the inline assembly, so it
cannot compile these functions. I could compile my <code class="language-plaintext highlighter-rouge">setjmp</code> with GCC and
the rest of the program with MSVC, which means I need two compilers.
Instead, I’ll move to pure assembly, assemble with GNU <code class="language-plaintext highlighter-rouge">as</code> (TODO: port
to MASM?) so we’ll only need a tiny piece of the GNU toolchain.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	<span class="nf">.global</span> <span class="nv">setjmp</span>
<span class="nl">setjmp:</span>
        <span class="nf">mov</span> <span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">lea</span> <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rsp</span><span class="p">),</span> <span class="o">%</span><span class="nb">rax</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rax</span><span class="p">,</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rbp</span><span class="p">,</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rbx</span><span class="p">,</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rdi</span><span class="p">,</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">rsi</span><span class="p">,</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r12</span><span class="p">,</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r13</span><span class="p">,</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r14</span><span class="p">,</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nv">r15</span><span class="p">,</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
	<span class="nf">xor</span> <span class="o">%</span><span class="nb">eax</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
	<span class="nf">ret</span>

	<span class="nf">.globl</span> <span class="nv">longjmp</span>
<span class="nl">longjmp:</span>
	<span class="nf">mov</span> <span class="mi">72</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r15</span>
	<span class="nf">mov</span> <span class="mi">64</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r14</span>
	<span class="nf">mov</span> <span class="mi">56</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r13</span>
	<span class="nf">mov</span> <span class="mi">48</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nv">r12</span>
	<span class="nf">mov</span> <span class="mi">40</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsi</span>
	<span class="nf">mov</span> <span class="mi">32</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rdi</span>
	<span class="nf">mov</span> <span class="mi">24</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbx</span>
	<span class="nf">mov</span> <span class="mi">16</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rbp</span>
	<span class="nf">mov</span>  <span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">),</span> <span class="o">%</span><span class="nb">rsp</span>
	<span class="nf">mov</span> <span class="o">%</span><span class="nb">edx</span><span class="p">,</span> <span class="o">%</span><span class="nb">eax</span>
	<span class="nf">jmp</span> <span class="o">*</span><span class="mi">0</span><span class="p">(</span><span class="o">%</span><span class="nb">rcx</span><span class="p">)</span>
</code></pre></div></div>

<p>Then some declarations in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
<span class="kt">int</span> <span class="nf">setjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">);</span>
<span class="k">_Noreturn</span> <span class="kt">void</span> <span class="nf">longjmp</span><span class="p">(</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>I’ll need to enable C11 for that <code class="language-plaintext highlighter-rouge">_Noreturn</code> in MSVC. Assemble, compile,
and link:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj
</code></pre></div></div>

<p>That generally works! If I rename to <code class="language-plaintext highlighter-rouge">xsetjmp</code> and <code class="language-plaintext highlighter-rouge">xlongjmp</code> to avoid
conflicting with the CRT definitions, drop them into the u-config test
suite in place of <code class="language-plaintext highlighter-rouge">setjmp.h</code>, then compile with MSVC, it passes all tests
using my alternate implementation in MSVC as well as GCC. Pretty cool!</p>

<h3 id="takeaway">Takeaway</h3>

<p>I’m not sure if I’ll ever use the assembly, but writing this article led
me to try the GCC intrinsics, and I’m so impressed I’m still thinking
about ways I can use them. My main thought is out-of-memory situations in
arena allocators, using a non-local exit to roll back to a savepoint, even
if just to return an error. This is nicer than either terminating the
program or handling OOM errors on every allocation. Very roughly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">cap</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">off</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="kt">jmp_buf</span><span class="p">[</span><span class="mi">5</span><span class="p">];</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="c1">// Place an arena and savepoint an out-of-memory jump.</span>
<span class="cp">#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))-&gt;jmp_buf)
</span>
<span class="c1">// Place a new arena at the front of the buffer.</span>
<span class="n">Arena</span> <span class="o">*</span><span class="nf">place</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">mem</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">size</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">Arena</span><span class="p">));</span>
    <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mem</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">Arena</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">cap</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">avail</span> <span class="o">&lt;</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">__builtin_longjmp</span><span class="p">(</span><span class="n">a</span><span class="o">-&gt;</span><span class="kt">jmp_buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">+</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage would look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">compute</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">workmem</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">memsize</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Arena</span> <span class="o">*</span><span class="n">arena</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">OOM</span><span class="p">(</span><span class="n">arena</span><span class="p">,</span> <span class="n">workmem</span><span class="p">,</span> <span class="n">memsize</span><span class="p">))</span> <span class="p">{</span>
        <span class="c1">// jumps here when out of memory</span>
        <span class="k">return</span> <span class="n">COMPUTE_OOM</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">Thing</span> <span class="o">*</span><span class="n">t</span> <span class="o">=</span> <span class="n">PUSHSTRUCT</span><span class="p">(</span><span class="n">arena</span><span class="p">,</span> <span class="n">Thing</span><span class="p">);</span>
    <span class="c1">// ...</span>

    <span class="k">return</span> <span class="n">COMPUTE_OK</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>More granular snapshots can be made further down the stack by allocating
subarenas out of the main arena. I have yet to try this out in a practical
program.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>My review of the C standard library in practice</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/02/11/"/>
    <id>urn:uuid:31a77d1d-219c-4677-995a-8e869f9ab610</id>
    <updated>2023-02-11T03:04:11Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34752400">on Hacker News</a> and critiqued <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/CaseForAtomicTypes">on
Wandering Thoughts</a>.</em></p>

<p>In general, when working in C I avoid the standard library, libc, as much
as possible. If possible I won’t even link it. For people not used to
working and thinking this way, the typical response is confusion. Isn’t
that like re-inventing the wheel? For me, <em>libc is a wheel barely worth
using</em> — too many deficiencies in both interface and implementation.
Fortunately, it’s easy to build a better, simpler wheel when you know the
terrain ahead of time. In this article I’ll review the functions and
function-like macros of the C standard library and discuss practical
issues I’ve faced with them.</p>

<!--more-->

<p>Fortunately the flexibility of C-in-practice makes up for the standard
library. I already have all the tools at hand to do what I need — not
beholden to any runtime.</p>

<p>How does one write portable software while relying little on libc?
Implement the bulk of the program as platform-agnostic, libc-free code
then write platform-specific code per target — a platform layer — each in
its own source file. The platform code is small in comparison: mostly
unportable code, perhaps <a href="/blog/2015/05/15/">raw system calls</a>, graphics functions, or
even assembly. It’s where you get access to all the coolest toys. On some
platforms it will still link libc anyway because it’s got useful
platform-specific features, or because it’s mandatory.</p>

<p>The discussion below is specifically about standard C. Some platforms
provide special workarounds for their standard function shortcomings, but
that’s irrelevant. If I need to use a non-standard function then I’m
already writing platform-specific code and I might as well take full
advantage of that fact, bypassing the original issue entirely by calling
directly into the platform.</p>

<p>The rest of this article goes through the standard library listing in <a href="https://web.archive.org/web/20181230041359if_/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf">the
C18 draft</a> mostly in order.</p>

<h3 id="assert-and-abort">assert and abort</h3>

<p>I <a href="/blog/2022/06/26/">wrote about the <code class="language-plaintext highlighter-rouge">assert</code> macro last year</a>. While C assertions
are better than the same in any other language I know — a trap <em>without
first unwinding the stack</em> — the typical implementation doesn’t have the
courtesy to trap in the macro itself, creating friction. Or worse, it
doesn’t trap at all and instead exits the process normally with a non-zero
status. It’s not optimized for debuggers.</p>

<p>My non-trivial programs quickly pick up this definition instead, adjusted
later as needed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define ASSERT(c) if (!(c)) __builtin_trap()
</span></code></pre></div></div>

<p>There’s no diagnostic, but I usually don’t want that anyway. The vast
majority of the time these are caught in a debugger, and I don’t need or
want a diagnostic.</p>

<p>I have no objections to <code class="language-plaintext highlighter-rouge">static_assert</code>, but it’s also not part of the
runtime.</p>

<h3 id="math-functions">Math functions</h3>

<p>By this I mean all the stuff in <code class="language-plaintext highlighter-rouge">math.h</code>, <code class="language-plaintext highlighter-rouge">complex.h</code>, etc. It’s good that
these are, in practice, pseudo-intrinsics. They’re also one of the more
challenging parts of libc to replace. It prioritizes precision more than I
usually need, but that’s a reasonable default.</p>

<h3 id="character-classification-and-mapping">Character classification and mapping</h3>

<p>Includes <code class="language-plaintext highlighter-rouge">isalnum</code>, <code class="language-plaintext highlighter-rouge">isalpha</code>, <code class="language-plaintext highlighter-rouge">isascii</code>, <code class="language-plaintext highlighter-rouge">isblank</code>, <code class="language-plaintext highlighter-rouge">iscntrl</code>, <code class="language-plaintext highlighter-rouge">isdigit</code>,
<code class="language-plaintext highlighter-rouge">isgraph</code>, <code class="language-plaintext highlighter-rouge">islower</code>, <code class="language-plaintext highlighter-rouge">isprint</code>, <code class="language-plaintext highlighter-rouge">ispunct</code>, <code class="language-plaintext highlighter-rouge">isspace</code>, <code class="language-plaintext highlighter-rouge">isupper</code>,
<code class="language-plaintext highlighter-rouge">isxdigit</code>, <code class="language-plaintext highlighter-rouge">tolower</code>, and <code class="language-plaintext highlighter-rouge">toupper</code>. The interface is misleading, almost
maliciously so, and these functions are misused in every case I’ve seen in
the wild. If you see <code class="language-plaintext highlighter-rouge">#include &lt;ctype.h&gt;</code> in a source file then it’s
probably defective. I’ve been guilty of it myself. When it’s up to me,
these functions are banned without exception.</p>

<p>Their prototypes are all shaped like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">isXXXXX</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
</code></pre></div></div>

<p>However, the domain of the input is <code class="language-plaintext highlighter-rouge">unsigned char</code> plus <code class="language-plaintext highlighter-rouge">EOF</code>. Negative
arguments, aside from <code class="language-plaintext highlighter-rouge">EOF</code>, are <em>undefined behavior</em>, despite the obvious
use case being strings. So this is incorrect:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="p">...;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">isdigit</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="p">{</span>   <span class="c1">// WRONG!</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">char</code> is signed, as it is on x86, then it’s undefined for arbitrary
strings, <code class="language-plaintext highlighter-rouge">s</code>. Some implementations even crash on such inputs.</p>

<p>If the argument was <code class="language-plaintext highlighter-rouge">unsigned char</code>, then it would at least truncate into
range, <em>usually</em> leading to the desired result. (Though not so if <a href="https://drewdevault.com/2020/09/25/A-story-of-two-libcs.html">passing
Unicode code points</a>, which is an odd mistake to make.) Except that
it has to accommodate <code class="language-plaintext highlighter-rouge">EOF</code>. Why that? <strong>These functions are defined for
use with <code class="language-plaintext highlighter-rouge">fgetc</code>, not strings!</strong></p>

<p>You could patch over it with truncation by masking:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">isdigit</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mi">255</span><span class="p">))</span> <span class="p">{</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, you’re still left with <em>locales</em>. This is a bit of <em>global state</em>
that changes <a href="https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe">how a number of libc functions behave</a>, including
character classification. While locales have some niche uses, most of the
time <a href="https://utcc.utoronto.ca/~cks/space/blog/linux/BashLocaleScriptDestruction?showcomments">the behavior is surprising and undesirable</a>. It’s also bad for
performance. I’ve developed a habit of using <code class="language-plaintext highlighter-rouge">LC_ALL=C</code> before some GNU
programs so that they behave themselves. If you’re parsing a fixed format
that doesn’t adapt to locale — virtually everything — you definitely do
not want locale-based character classification of input.</p>

<p>Since the interface and behavior both unsuited for most uses, you’re
better off making your own range checks or lookup tables for your use
case. When you name it, probably avoid starting the function with <code class="language-plaintext highlighter-rouge">is</code>
since <a href="https://devblogs.microsoft.com/oldnewthing/20230109-00/?p=107685">it’s reserved</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">xisdigit</span><span class="p">(</span><span class="kt">char</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">c</span><span class="o">&gt;=</span><span class="sc">'0'</span> <span class="o">&amp;&amp;</span> <span class="n">c</span><span class="o">&lt;=</span><span class="sc">'9'</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I used <code class="language-plaintext highlighter-rouge">char</code>, but this still works fine for naive UTF-8 parsing.</p>

<h3 id="errno">errno</h3>

<p>Without libc you don’t have to use this global, hopefully thread-local,
pseudo-variable. Good riddance. Return your errors, and use a struct if
necessary.</p>

<h3 id="locales">locales</h3>

<p>As discussed, locales have some niche uses — formatting dates comes to
mind — but what little use they have is trapped behind global state set by
<code class="language-plaintext highlighter-rouge">setlocale</code>, making it sometimes impossible to use correctly.</p>

<p>On Windows I’ve instead used <a href="https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfow">GetLocaleInfoW</a> to get information like,
“What is the local name of the <a href="https://github.com/skeeto/scratch/blob/master/windows/cal.c">current month</a>?”</p>

<h3 id="setjmp-and-longjmp">setjmp and longjmp</h3>

<p>Sometimes tricky to use correctly, particularly with regard to qualifying
local variables as <code class="language-plaintext highlighter-rouge">volatile</code>. It can compose with region-based allocation
to <a href="/blog/2023/01/18/#implementation-highlights">automatically and instantly free</a> all objects created between set
and jump. These macros are fine, but don’t overdo it.</p>

<h3 id="variable-arguments">variable arguments</h3>

<p>Variadic functions are occasionally useful, and the <code class="language-plaintext highlighter-rouge">va_start</code>/<code class="language-plaintext highlighter-rouge">va_end</code>
macros make them possible. These are, unfortunately, notoriously complex
because calling conventions do not go out of their way to make them any
simpler. They <a href="https://blog.nelhage.com/2010/10/amd64-and-va_arg/">require compiler assistance</a>, and in practice they’re
implemented as part of the compiler rather than libc. They’re okay, but I
can live without it.</p>

<h3 id="signals">signals</h3>

<p>While important on unix-like systems, signals as defined in the C standard
library are essentially useless. If you’re dealing with signals, or even
something <em>like</em> signals, it will be in platform-specific code that goes
beyond the C standard library.</p>

<h3 id="atomics">atomics</h3>

<p>I’ve used the <code class="language-plaintext highlighter-rouge">_Atomic</code> qualifier <a href="/blog/2022/05/14/">in examples</a> since it helps with
conciseness, but I hardly use it in practice. In part because it has the
inconvenient effect of bleeding into APIs and ABIs. As with <code class="language-plaintext highlighter-rouge">volatile</code>, C
is using the type system to indirectly achieve a goal. Types are not
atomic, <em>loads</em> and <em>stores</em> are atomic. Predating standardization, C
implementations have been expressing these loads and stores using
intrinsics, functions, or macros rather than through types.</p>

<p>The <code class="language-plaintext highlighter-rouge">_Atomic</code> qualifier provides access to the most basic and most strict
atomic operations without libc. That is, it’s implemented purely in the
compiler. However, everything outside that involves libc, and potentially
even requires linking a special atomics library.</p>

<p>Even more, one major implementation (MSVC) still doesn’t support C11
atomics. Anywhere I care about using C atomics, I can already use the
<a href="https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html">richer set of GCC built-ins</a>, which Clang also supports. If I’m
writing code intended for Windows, I’ll use the <a href="https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access">interlocked macros</a>,
which work <a href="/blog/2022/10/05/#four-elements-windows">across all the compilers</a> for that platform.</p>

<h3 id="stdio">stdio</h3>

<p>Standard input and output, stdio, is perhaps the primary driving factor
for my own routing around libc. Nearly every program does some kind of
input or output, but going through stdio makes things harder.</p>

<p>To read or write a file, one must first open it, e.g. <code class="language-plaintext highlighter-rouge">fopen</code>. However,
all the implementations for one platform in particular <a href="/blog/2021/12/30/">does not allow
<code class="language-plaintext highlighter-rouge">fopen</code> to access most of the file system</a>, so using libc immediately
limits the program’s capabilities on that platform.</p>

<p>The standard library distinguishes between “text” and “binary” streams. It
makes no difference on unix-like platforms, but it does on others, where
input and output are translated. Besides destroying your data, text
streams have terrible performance. Opening everything in binary mode is a
simple enough work around, but standard input, output, and error are
opened as text streams, and there is no standard function for changing
them to binary streams.</p>

<p>When using <code class="language-plaintext highlighter-rouge">fread</code>, some implementations use the entire buffer as a
<a href="https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fread#remarks">temporary work space</a>, even if it returns a length less than the
entire buffer. So the following won’t work reliably:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="n">fread</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
<span class="n">puts</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>It may print junk after the expected output because <code class="language-plaintext highlighter-rouge">fread</code> overwrote the
zeroes beyond it.</p>

<p>Streams are buffered, and there’s no reliable access to unbuffered input
and output, such as when an application is already buffering, perhaps as a
natural consequence of how it works. There’s <code class="language-plaintext highlighter-rouge">setvbuf</code> and <code class="language-plaintext highlighter-rouge">_IONBF</code>
(“unbuffered”), but in <a href="https://github.com/openssl/openssl/issues/3281">at least one case</a> this really just means
“one byte at a time.” It’s common for my libc-using programs to end up
with double buffering since I can’t reliably turn off stdio buffering.</p>

<p>Typical implementations assume streams will be used by multiple threads,
and so every access goes through a mutex. This causes terrible performance
for small reads and writes — exactly the case buffering is supposed to
most <em>help</em>. Not only is this unusual, such programs are probably broken
anyway — oblivious to the still-present race conditions — and so <strong>stdio
is optimized for the unusual, broken case at the cost of the most needed
typical case</strong>.</p>

<p>There is no reliable way to interactively input and display Unicode text.
The C standard makes vague concessions for dealing with “wide characters”
but it’s useless in practice. I’ve tried! The most common need for me is
printing a path to standard error such that it displays properly to the
user.</p>

<p>Seek offsets are limited to <code class="language-plaintext highlighter-rouge">long</code>. Some real implementations can’t even
open files large than 2GiB.</p>

<p>Rather than deal with all this, I add a couple of unbuffered I/O functions
to the platform layer, then put a small buffered stream implementation in
the application which flushes to the platform layer. UTF-8 for text input
and output, and if the platform layer detects it’s connected to a terminal
or console, it does the appropriate translation. It doesn’t take much to
get something more reliable than stdio. The details are <a href="/blog/2023/02/13/">the topic for a
future article</a>, especially since you might be wondering about
formatted output.</p>

<p>As for formatted input, <a href="https://sekrit.de/webdocs/c/beginners-guide-away-from-scanf.html">don’t ever bother with <code class="language-plaintext highlighter-rouge">scanf</code></a>.</p>

<h3 id="numeric-conversion">Numeric conversion</h3>

<p>Float conversion <a href="https://www.exploringbinary.com/a-better-way-to-convert-integers-in-david-gays-strtod/">is generally a difficult problem</a>, especially if
you care about <a href="https://possiblywrong.wordpress.com/2015/06/21/floating-point-round-trips/">round trips</a>. It’s one of the better and most useful
parts of libc. Though even with libc it’s still difficult to get the
simplest or shortest round-trip representation. Also, this is an area
where changing locales can be disastrous!</p>

<p>The question is then: How much does this matter in your application’s
context? There’s a good chance you only need to display a rounded,
low-precision representation of a float to users — perhaps displaying a
player’s position in a debug window, etc. Or you only need to parse
medium-precision non-integral inputs following a relatively simple format.
These are not so difficult.</p>

<p>Parsing (<code class="language-plaintext highlighter-rouge">atoi</code>, <code class="language-plaintext highlighter-rouge">strtol</code>, <code class="language-plaintext highlighter-rouge">strtod</code>, etc.) requires null-terminated
strings, which is generally inconvenient. These integers likely came from
something <em>not</em> null-terminated like a file, and so I need to first append
a null terminator. I can’t just feed it a token from a memory-mapped file.
Even when using libc, I often write my own integer parser anyway since the
libc parsers lack an appropriate interface.</p>

<p><strong>Update</strong>: NRK points out that <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C20230218081805.z57kyrbc6xzqlnx6%40gen2.localdomain%3E">unsigned integer parsing treats negative
inputs as in range</a>. This is both surprising and rarely useful.
Looking more closely at the specification, I see it is also affected by
locale. Given these revelations, I would ban without exception <code class="language-plaintext highlighter-rouge">atoi</code>,
<code class="language-plaintext highlighter-rouge">atol</code>, <code class="language-plaintext highlighter-rouge">strtoul</code>, and <code class="language-plaintext highlighter-rouge">strtoull</code>, and avoid <code class="language-plaintext highlighter-rouge">strtol</code> and <code class="language-plaintext highlighter-rouge">strtoll</code>.</p>

<p>Formatting integers is easy. Parsing integers within in narrow range (e.g.
up to a million) is easy. Parsing integers to the very limits of the
numeric type <a href="https://github.com/skeeto/scratch/blob/master/parsers/strtonum.c">is tricky</a> because every operation must guard
against overflow regardless of signed or unsigned. Fortunately the first
two are common and the last is rarely necessary!</p>

<h3 id="random-numbers">Random numbers</h3>

<p>We have <code class="language-plaintext highlighter-rouge">rand</code>, <code class="language-plaintext highlighter-rouge">srand</code>, and <code class="language-plaintext highlighter-rouge">RAND_MAX</code>. As <a href="/blog/2017/09/21/">a PRNG enthusiast</a>, I
could never recommend using this under any circumstances. It’s a PRNG with
mediocre output, poor performance, and global state. <code class="language-plaintext highlighter-rouge">RAND_MAX</code> being
unknown ahead of time makes it even more difficult to make effective use
of <code class="language-plaintext highlighter-rouge">rand</code>. You can do better on all dimensions with just <a href="https://www.pcg-random.org/download.html">a few lines of
code</a>.</p>

<p>To make matters worse, typical implementations <em>expect</em> it to be accessed
concurrently from multiple threads, so they wrap it in a mutex. Again, it
optimizes for the unusual, broken case — threads fighting each other over
non-deterministic racy results from a deterministic PRNG — at the cost of
the typical, sensible case. Programs relying on that mutex are already
broken.</p>

<h3 id="memory-allocation">Memory allocation</h3>

<p>Includes <code class="language-plaintext highlighter-rouge">malloc</code>, <code class="language-plaintext highlighter-rouge">calloc</code>, <code class="language-plaintext highlighter-rouge">realloc</code>, <code class="language-plaintext highlighter-rouge">free</code>, etc. Okay, but in practice
used too granularly and too much such that many C programs <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">are tangles of
lifetimes</a>. Sometimes I wish there was a standard region allocator
so that independently-written libraries could speak a common, sensible,
<a href="/blog/2018/06/10/">caller-controlled</a> allocation interface.</p>

<p>A major standardization failure here has been not moving size computations
into the allocators themselves. <code class="language-plaintext highlighter-rouge">calloc</code> is a start: You say how big and
how many, and it works out the total allocation, checking for overflow.
There should be more of this, even if just to <a href="https://www.youtube.com/watch?v=f4ioc8-lDc0&amp;t=4407s">discourage individual
allocations and encourage group allocations</a>.</p>

<p>There are some edge cases around zero sizes, like <code class="language-plaintext highlighter-rouge">malloc(0)</code>, and the
standard leaves the behavior <a href="https://yarchive.net/comp/linux/malloc_0.html">a bit too open ended</a>. However, if your
program is so poorly structured such that it may possibly pass zero to
<code class="language-plaintext highlighter-rouge">malloc</code> then you have bigger problems anyway.</p>

<h3 id="communication-with-the-environment">Communication with the environment</h3>

<p><code class="language-plaintext highlighter-rouge">getenv</code> is straightforward, though I’d prefer to just access the
environment block directly, <em>a la</em> the non-standard third argument to
<code class="language-plaintext highlighter-rouge">main</code>.</p>

<p><code class="language-plaintext highlighter-rouge">exit</code> is fine, but <code class="language-plaintext highlighter-rouge">atexit</code> is jank.</p>

<p><code class="language-plaintext highlighter-rouge">system</code> is <a href="/blog/2022/02/18/">essentially useless</a> in practice.</p>

<h3 id="sorting-and-searching">Sorting and searching</h3>

<p><code class="language-plaintext highlighter-rouge">qsort</code> is <del><a href="https://lists.sr.ht/~skeeto/public-inbox/%3C1676216350-sup-9306%40thyestes.tartarus.org%3E">fine</a></del>poor because it <a href="/blog/2017/01/08/">lacks a context argument</a>.
<a href="/blog/2016/09/05/">Quality varies</a>. Not difficult to implement from scratch if
necessary. I rarely need to sort.</p>

<p>Similar story  for <code class="language-plaintext highlighter-rouge">bsearch</code>. Though if I need a binary search over an
array, <code class="language-plaintext highlighter-rouge">bsearch</code> probably isn’t sufficient because I usually want to find
lower and upper bounds of a range.</p>

<h3 id="multi-byte-encodings-and-wide-characters">Multi-byte encodings and wide characters</h3>

<p><code class="language-plaintext highlighter-rouge">mblen</code>, <code class="language-plaintext highlighter-rouge">mbtowc</code>, <code class="language-plaintext highlighter-rouge">mbtowc</code>, <code class="language-plaintext highlighter-rouge">wctomb</code>, <code class="language-plaintext highlighter-rouge">mbstowcs</code>, and <code class="language-plaintext highlighter-rouge">wcstombs</code> are
connected to the locale system and don’t necessarily operate on any
particular encodings like UTF-8, which makes them unreliable. This is the
case for all the other wide character functionality, which is quite a few
functions. Fortunately I only ever need wide characters on one platform in
particular, not in portable code.</p>

<p>More recently are <code class="language-plaintext highlighter-rouge">mbrtoc16</code>, <code class="language-plaintext highlighter-rouge">c16rtomb</code>, <code class="language-plaintext highlighter-rouge">mbrtoc32</code>, and <code class="language-plaintext highlighter-rouge">c32rtomb</code> where
the “wide” side is specified (UTF-16, UTF-32) but not the multi-byte side.
Limited support in implementations and not particularly useful.</p>

<h3 id="strings">Strings</h3>

<p>Like <code class="language-plaintext highlighter-rouge">ctype.h</code>, <code class="language-plaintext highlighter-rouge">string.h</code> is another case where everything is terrible,
and some functions are virtually always misused.</p>

<p><code class="language-plaintext highlighter-rouge">memcpy</code>, <code class="language-plaintext highlighter-rouge">memmove</code>, <code class="language-plaintext highlighter-rouge">memset</code>, and <code class="language-plaintext highlighter-rouge">memcmp</code> are fine except for one issue:
it is undefined behavior to pass a null pointer to these functions, <em>even
with a zero size</em>. That’s ridiculous. A null pointer legitimately and
usefully points to a zero-sized object. As mentioned, even <code class="language-plaintext highlighter-rouge">malloc(0)</code> is
permitted to behave this way. These functions would be fine if not for
this one defect.</p>

<p><code class="language-plaintext highlighter-rouge">strcpy</code>, <code class="language-plaintext highlighter-rouge">strncpy</code>, <code class="language-plaintext highlighter-rouge">strcat</code>, and <code class="language-plaintext highlighter-rouge">strncat</code> <a href="/blog/2021/07/30/">have no legitimate
uses</a> and their use indicates confusion. As such, any code calling
them is suspect and should receive extra scrutiny. In fact, <strong>I have yet
to see a single correct use of <code class="language-plaintext highlighter-rouge">strncpy</code> in a real program</strong>. (Usage hint:
the length argument should refer to the destination, not the source.) When
it’s up to me, these functions are banned without exception. This applies
equally to non-standard versions of these functions like <code class="language-plaintext highlighter-rouge">strlcpy</code>.</p>

<p><code class="language-plaintext highlighter-rouge">strlen</code> has legitimate uses, but is used too often. It should only appear
at system boundaries when receiving strings of unknown size (e.g. <code class="language-plaintext highlighter-rouge">argv</code>,
<code class="language-plaintext highlighter-rouge">getenv</code>), and should never be applied to a static string. (Hint: you can
use <code class="language-plaintext highlighter-rouge">sizeof</code> on those.)</p>

<p>When I see <code class="language-plaintext highlighter-rouge">strchr</code>, <code class="language-plaintext highlighter-rouge">strcmp</code> or <code class="language-plaintext highlighter-rouge">strncmp</code> I wonder why you don’t know the
lengths of your strings. On the other hand, <code class="language-plaintext highlighter-rouge">strcspn</code>, <code class="language-plaintext highlighter-rouge">strpbrk</code>,
<code class="language-plaintext highlighter-rouge">strrchr</code>, <code class="language-plaintext highlighter-rouge">strspn</code>, and <code class="language-plaintext highlighter-rouge">strstr</code> do not have <code class="language-plaintext highlighter-rouge">mem</code> equivalents, though
the null termination requirement hurts their usefulness.</p>

<p><code class="language-plaintext highlighter-rouge">strcoll</code> and <code class="language-plaintext highlighter-rouge">strxfrm</code> depend on locale and so are at best niche.
Otherwise unpredictable. Avoid.</p>

<p><code class="language-plaintext highlighter-rouge">memchr</code> is fine except for the aforementioned null pointer restriction,
though it comes up less often here.</p>

<p><code class="language-plaintext highlighter-rouge">strtok</code> has hidden global state. Besides that, how long is the returned
token? It knew the length before it returned. You mean I have to call
<code class="language-plaintext highlighter-rouge">strlen</code> to find out? Banned.</p>

<p><code class="language-plaintext highlighter-rouge">strerror</code> has an obvious, simple, robust solution: return a pointer to a
static string in a lookup table corresponding to the error number. No
global state, thread-safe, re-entrant, and the returned string is good
until the program exits. Some implementations do this, but unfortunately
it’s not true for <a href="https://man.freebsd.org/cgi/man.cgi?query=strerror&amp;sektion=0">at least one real world implementation</a>,
which instead writes to a shared, global buffer. Hopefully you were
avoiding <code class="language-plaintext highlighter-rouge">errno</code> anyway.</p>

<h3 id="threads">Threads</h3>

<p>Introduced in C11, but never gained significant traction. Anywhere you can
use C threads you can use pthreads, which are better anyway.</p>

<p>Besides, thread creation probably belongs in the platform layer anyway.</p>

<h3 id="time-functions">Time functions</h3>

<p>Fairly niche, and I can’t remember using any of these except for <code class="language-plaintext highlighter-rouge">time</code>
and <code class="language-plaintext highlighter-rouge">clock</code> <a href="/blog/2019/04/30/">for seeding</a>.</p>

<h3 id="wrap-up">Wrap-up</h3>

<p>I hand-waved away a long list of vestigial wide character functions, but
the above is pretty much all there is to the C standard library. The only
things I miss when avoiding it altogether are the math functions, and
<a href="/blog/2023/02/12/">occasionally <code class="language-plaintext highlighter-rouge">setjmp</code>/<code class="language-plaintext highlighter-rouge">longjmp</code></a>. Everything else I can do better
myself, with little difficulty, starting from the platform layer.</p>

<p>All of the C implementations I had in mind above are very old. They will
rarely, if ever, <em>change</em>, just accrue. There isn’t a lot of innovation
happening in this space, which is fine since I like stable targets. If you
<em>would</em> like to see interesting innovation, check out <a href="https://justine.lol/sizetricks/">what Cosmopolitan
Libc is up to</a>. It’s what I imagine C could be if it continued
evolving along practical dimensions.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>u-config: a new, lean pkg-config clone</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/01/18/"/>
    <id>urn:uuid:c07ce83a-7871-4561-a77f-3b62b7a817bd</id>
    <updated>2023-01-18T06:39:51Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34426430">on Hacker News</a>.</em></p>

<p>In <a href="/blog/2023/01/08/">my common SDL2 mistakes listing</a>, the first was about winging it
instead of using the <code class="language-plaintext highlighter-rouge">sdl2-config</code> script. It’s just one of three official
options for portably configuring SDL2, but I had dismissed the others from
consideration. One is the <a href="https://www.freedesktop.org/wiki/Software/pkg-config/">pkg-config</a> facility common to unix-like
systems. However, the SDL maintainers recently announced SDL3, which will
not have a <code class="language-plaintext highlighter-rouge">sdl3-config</code>. The concept has been deprecated in favor of the
existing pkg-config option. I’d like to support this on w64devkit, except
that it lacks pkg-config — not the first time this has come up. So last
weekend I wrote a new pkg-config from scratch with first-class Windows
support: <strong><a href="https://github.com/skeeto/u-config">u-config</a></strong> (“<em>micro</em>-config”). It will serve as pkg-config
in w64devkit starting in the next release.</p>

<!--more-->

<p>Ultimately pkg-config’s entire job is to find named <code class="language-plaintext highlighter-rouge">.pc</code> text files in
one of several predetermined locations, read fields from them, then write
those fields to standard output. Additional search directories may be
supplied through the <code class="language-plaintext highlighter-rouge">$PKG_CONFIG_PATH</code> environment variable. At a high
level there’s really not much to it.</p>

<p>As a concrete example, here’s a hypothetical <code class="language-plaintext highlighter-rouge">example.pc</code> which might live
in <code class="language-plaintext highlighter-rouge">/usr/lib/pkgconfig</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prefix = /usr
major = 1
minor = 2
patch = 3
version = ${major}.${minor}.${patch}

Name: Example Library
Description: An example of a .pc file
Version: ${version}
Requires: zlib &gt;= 1.2, sdl2
Libs: -L${prefix}/lib -lexample
Libs.private: -lm
Cflags: -I${prefix}/include
Cflags.private: -DEXAMPLE_STATIC
</code></pre></div></div>

<p>If you invoke pkg-config with <code class="language-plaintext highlighter-rouge">--cflags</code> you get the <code class="language-plaintext highlighter-rouge">Cflags</code> field. With
<code class="language-plaintext highlighter-rouge">--libs</code>, you get the <code class="language-plaintext highlighter-rouge">Libs</code> field. With <code class="language-plaintext highlighter-rouge">--static</code>, you also get the
“private” fields. It will also recursively pull in packages mentioned in
<code class="language-plaintext highlighter-rouge">Requires</code>. The <code class="language-plaintext highlighter-rouge">prefix</code> variable is more than convention and is designed
to be overridden (and u-config does so by default). In theory pkg-config
is supposed to be careful about maintaining argument order and removing
redundant arguments, but in practice… well, pkg-config’s actual behavior
often makes little sense. We’ll get to that.</p>

<p>For SDL2, where you might use:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc app.c $(sdl2-config --cflags --libs)
</code></pre></div></div>

<p>You could instead use:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ eval cc app.c $(pkg-config sdl2 --cflags --libs)
</code></pre></div></div>

<p>Which is still a build command that works uniformly for all supported
platforms, even cross-compiling, given a correctly-configured pkg-config.
For w64devkit, the first command requires placing the directory containing
<code class="language-plaintext highlighter-rouge">sdl2-config</code> on your <code class="language-plaintext highlighter-rouge">$PATH</code>. The second instead requires placing the
directory containing <code class="language-plaintext highlighter-rouge">sdl2.pc</code> in your <code class="language-plaintext highlighter-rouge">$PKG_CONFIG_PATH</code>. To upgrade to
SDL3, replace the <code class="language-plaintext highlighter-rouge">sdl2</code> with <code class="language-plaintext highlighter-rouge">sdl3</code> in the second command.</p>

<h3 id="why-two-when-you-can-have-three">Why two when you can have three?</h3>

<p>There are already two major, mostly-compatible pkg-config implementations:
the original from freedesktop.org (2001), and <a href="http://pkgconf.org/">pkgconf</a> (2011). Both
ostensibly support Windows, but in practice this support is second class,
which is a reason why I hadn’t included one in w64devkit. A lot of hassle
for what is a ultimately a relatively simple task.</p>

<p>As for the original pkg-config, I’ve been unable to produce a functioning
Windows build. It’s obvious from the compiler warnings that there are many
problems, and my builds immediately crash on start. I’d try debugging it,
except that I’ve been cross-compiling this whole time. I cannot build it
on Windows because (1) GNU Autotools and (2) pkg-config <del>requires</del>wants
pkg-config as a build dependency. That’s right, <em>you have to bootstrap
pkg-config</em>! Remember, this is a tool whose entire job is to copy some
bits of text from a text file to its output. One could use pkg-config as a
case study of accidental complexity, and this is just the beginning.</p>

<p><em>Update</em>: It was <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C1750680.o7JgDH7DvH%40laptop%3E">pointed out</a> that I wouldn’t need the full,
two-stage bootstrap just for my debugging scenario.</p>

<p>The bootstrap issue is part of pkgconf’s popularity as an alternative.
It’s also a tidier code base, does a <em>far</em> better job of sorting and
arranging its outputs than the original pkg-config, and its overall
behavior makes more sense. However, despite its three independent build
systems, pkgconf is still annoying to build, not to mention its memory
corruption bugs. We’ll get to that, too.</p>

<p>Considering pkg-config’s relatively simple job, obtaining one shouldn’t be
this difficult! I could muddle through until one or the other worked, or I
could just write my own. I’m glad I did, since I’m extremely happy with
the results.</p>

<h3 id="u-config-implementation">u-config implementation</h3>

<p>As of this writing, u-config is about 2,000 lines of C. It doesn’t support
every last pkg-config feature, nor will it ever. The goal is to support
support existing pkg-config based builds, not make more of them. So, for
example, features for debugging <code class="language-plaintext highlighter-rouge">.pc</code> files are omitted. Some features are
of dubious usefulness (<code class="language-plaintext highlighter-rouge">--errors-to-stdout</code>) even if they’d be simple to
implement; there are already way too many flags. Other features clearly
don’t work correctly — either not as documented or the results don’t make
sense — so I skipped those as well.</p>

<p>It comes in two flavors: “generic” C and Windows. The former works on any
system with a C99 compiler. In fact, it only uses these 9 standard library
functions:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">exit</code></li>
  <li><code class="language-plaintext highlighter-rouge">fclose</code></li>
  <li><code class="language-plaintext highlighter-rouge">ferror</code></li>
  <li><code class="language-plaintext highlighter-rouge">fflush</code></li>
  <li><code class="language-plaintext highlighter-rouge">fopen</code></li>
  <li><code class="language-plaintext highlighter-rouge">fread</code></li>
  <li><code class="language-plaintext highlighter-rouge">fwrite</code></li>
  <li><code class="language-plaintext highlighter-rouge">getenv</code></li>
  <li><code class="language-plaintext highlighter-rouge">malloc</code></li>
</ul>

<p>That is, it needs to open <code class="language-plaintext highlighter-rouge">.pc</code> files, read from them, close those
handles, write to standard output and standard error, check for I/O
errors, and exactly once call <code class="language-plaintext highlighter-rouge">malloc</code> to allocate a block of memory for
an arena allocator. It’s not even important the streams are buffered
because u-config does its own buffering. Not that it would be useful, but
porting to an unhosted 16-bit microcontroller, with <code class="language-plaintext highlighter-rouge">fopen</code> implemented as
a virtual file system, would be trivial. (You know… it could be dropped
into <a href="https://frippery.org/busybox/">busybox-w32</a> as a new app with little effort…)</p>

<p>It’s also a unity build — compiled as a single translation unit — so
building u-config is as easy as it gets:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o pkg-config generic_main.c
</code></pre></div></div>

<p>Reminder: the original pkg-config <em>cannot even be built without a
bootstrapping step.</em></p>

<p>Since standard C functions are <a href="/blog/2021/12/30/">implemented poorly on Windows</a>, but
also so that it can do some smarter self-configuration at run-time based
on the <code class="language-plaintext highlighter-rouge">.exe</code> location, the Windows platform layer calls directly into
Win32 and no C runtime (CRT) is used. Input <code class="language-plaintext highlighter-rouge">.pc</code> files are memory mapped.
Internally u-config is all UTF-8, and the platform layer does the Unicode
translations at the Win32 boundaries for paths, <a href="/blog/2022/02/18/">arguments</a>, environment
variables, and console outputs.</p>

<p>Building is <em>slightly</em> more complicated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -o pkg-config -nostartfiles win32_main.c
</code></pre></div></div>

<h3 id="implementation-highlights">Implementation highlights</h3>

<p>Greenfield projects present a great opportunity for trying new things, and
this is no exception. Contrary to my usual style, I decided I would make
substantial use of <code class="language-plaintext highlighter-rouge">typedef</code> and capitalize all the type names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">int</span> <span class="n">Bool</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">Byte</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Byte</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
    <span class="n">Size</span> <span class="n">len</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Str</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">head</span><span class="p">;</span>
    <span class="n">Str</span> <span class="n">tail</span><span class="p">;</span>
    <span class="n">Bool</span> <span class="n">ok</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Cut</span><span class="p">;</span>
</code></pre></div></div>

<p>I like it! It makes the type names stand apart, avoids conflicts with
variable names, and cuts down the visual noise of <code class="language-plaintext highlighter-rouge">struct</code>. I’ve more
recently realized that <code class="language-plaintext highlighter-rouge">const</code> is doing virtually nothing for me — it has
never prevented me from making a mistake — so I left it out (aside from
static lookup tables). That’s even more visual noise gone, and reduced
cognitive load.</p>

<p>In recent years I’ve been convinced that unsigned sizes were a serious
error, probably even one of the great early computing mistakes, and that
<a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">sizes and subscripts should be signed</a>. Not only that, pkg-config
has no business dealing with gigantic objects! We’re talking about short
strings and tiny files. If it ends up with a large object, then there’s a
defect somewhere — either in itself or the system — and it should abort.
Therefore sizes and subscripts are a natural <code class="language-plaintext highlighter-rouge">int</code>!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">int</span> <span class="n">Size</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="n">Usize</span><span class="p">;</span>
<span class="cp">#define Size_MAX (Size)((Usize)-1 &gt;&gt; 1)
#define SIZEOF(x) (Size)(sizeof(x))
</span></code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">Usize</code> is just for the occasional bit-twiddling, like in <code class="language-plaintext highlighter-rouge">Size_MAX</code>,
and not for regular use. However, u-config objects are no smaller by this
decision because the unused space is nearly always padded on 64-bit
machines. Further, the x86-64 code is about 5% larger with 32-bit sizes
compared to 64-bit sizes — opposite my expectation. Curious.</p>

<p>You might have noticed that <code class="language-plaintext highlighter-rouge">Str</code> type above. Aside from interfaces with
the host that make it mandatory, u-config makes no use of null-terminated
strings anywhere. Every string is a pointer and a size. There’s even a
macro to do this for string literals:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S(s) (Str){(Byte *)s, SIZEOF(s)-1}
</span></code></pre></div></div>

<p>Then I can use and pass them casually:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">equals</span><span class="p">(</span><span class="n">realname</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"pkg-config"</span><span class="p">)))</span> <span class="p">{</span>
        <span class="c1">// ...</span>
    <span class="p">}</span>

    <span class="o">*</span><span class="n">insert</span><span class="p">(</span><span class="n">arena</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">global</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"pc_sysrootdir"</span><span class="p">))</span> <span class="o">=</span> <span class="n">S</span><span class="p">(</span><span class="s">"/"</span><span class="p">);</span>

    <span class="k">return</span> <span class="nf">startswith</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"-I"</span><span class="p">));</span>
</code></pre></div></div>

<p>Like strings in other languages, I can also slice out the middle of
strings without copying, handy for parsing and constructing paths. It also
works well with memory-mapped <code class="language-plaintext highlighter-rouge">.pc</code> files since I can extract tokens from
them for use directly in data structures without copying.</p>

<p>That leads into the next item: How does one free or manipulate a data
structure where the different parts are arbitrarily allocated across
static storage, heap storage, and memory mapped files? The hash tables in
u-config are exactly this, the keys themselves allocated in every possible
fashion. Don’t you have to keep track of how pointed-at part is allocated?
No! The individual objects do not have <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">individual lifetimes</a> due
to the arena allocator. The gist of it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">mem</span><span class="p">;</span>
    <span class="n">Size</span> <span class="n">off</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Arena</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">alloc</span><span class="p">(</span><span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="n">Size</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ASSERT</span><span class="p">(</span><span class="n">size</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">Size</span> <span class="n">avail</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">mem</span><span class="p">.</span><span class="n">len</span> <span class="o">-</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">avail</span> <span class="o">&lt;</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">oom</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="n">Byte</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">mem</span><span class="p">.</span><span class="n">s</span> <span class="o">+</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span><span class="p">;</span>
    <span class="n">a</span><span class="o">-&gt;</span><span class="n">off</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since it’s passed often, arena parameters are conventionally named <code class="language-plaintext highlighter-rouge">a</code>
throughout the program and are always the first argument when needed. If
it runs out of memory, it bails. On 32-bit and 64-bit hosts, the default
arena is 256MiB. If pkg-config needs more than that, then something’s
seriously wrong and it should give up.</p>

<p>While u-config <em>could</em> quite reasonably never “free” (read: reuse) memory,
it <em>does</em> do so in practice. In some cases it computes a temporary result,
then resets the arena to an earlier state to discard its allocations. A
simplified, hypothetical:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">...)</span> <span class="p">{</span>
        <span class="n">Arena</span> <span class="n">tmparena</span> <span class="o">=</span> <span class="o">*</span><span class="n">a</span><span class="p">;</span>
        <span class="c1">// Use only tmparena in the loop</span>
        <span class="n">Env</span> <span class="n">env</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
        <span class="n">Str</span> <span class="n">value</span> <span class="o">=</span> <span class="n">fmtint</span><span class="p">(</span><span class="o">&amp;</span><span class="n">tmparena</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="o">*</span><span class="n">insert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">tmparena</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">env</span><span class="p">,</span> <span class="n">S</span><span class="p">(</span><span class="s">"i"</span><span class="p">))</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
        <span class="c1">// ...</span>
        <span class="c1">// allocations freed when tmparena goes out of scope</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>I had mentioned that u-config does its own output buffering. It’s an
object I call an <code class="language-plaintext highlighter-rouge">Out</code>, modeled loosely after a <a href="https://9p.io/sys/doc/comp.html">Plan 9 <code class="language-plaintext highlighter-rouge">bio</code></a> or a Go
<code class="language-plaintext highlighter-rouge">bufio.Writer</code>. It has a destination “file descriptor”, a memory buffer,
and an integer to track the fill level of the buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="n">Str</span> <span class="n">buf</span><span class="p">;</span>
    <span class="n">Size</span> <span class="n">fill</span><span class="p">;</span>
    <span class="n">Arena</span> <span class="o">*</span><span class="n">a</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
<span class="p">}</span> <span class="n">Out</span><span class="p">;</span>
</code></pre></div></div>

<p>Output bytes are copied into the buffer. When it fills, the buffer is
automatically emptied into the file descriptor. The caller can manually
flush the buffer at any time, and it’s up to the caller to do so before
exiting the program.</p>

<p>But wait, what’s the <code class="language-plaintext highlighter-rouge">Arena</code> pointer doing in there? That’s a little extra
feature of my own invention! I can open a stream on an arena, writes into
the stream go into a growing buffer, and “closing” the stream gives me a
string allocated in the arena with the written content. The arena is held
in order to manage all this. It’s also locked out from other allocations
until the stream is closed. The entire implementation is only about a
dozen lines of code.</p>

<p>What use is this? It’s nice when I might want to output either to standard
output or to a memory buffer for further use. It’s even more useful when I
need to build a string but don’t know its final length ahead of time.</p>

<p>The variable expansion function is both cases. Given a string like
<code class="language-plaintext highlighter-rouge">${version}</code> I want to recursively interpolate until there’s nothing left
to interpolate. The output could go to standard output to print it out, or
into a string for further use. For example, here I have my global variable
environment <code class="language-plaintext highlighter-rouge">global</code>, a package <code class="language-plaintext highlighter-rouge">pkg</code>, its environment (<code class="language-plaintext highlighter-rouge">pkg-&gt;env</code>), and I
want to expand its <code class="language-plaintext highlighter-rouge">Version:</code> field, <code class="language-plaintext highlighter-rouge">pkg-&gt;version</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Out</span> <span class="n">mem</span> <span class="o">=</span> <span class="n">newmembuf</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
    <span class="n">expand</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mem</span><span class="p">,</span> <span class="n">global</span><span class="p">,</span> <span class="n">pkg</span><span class="p">,</span> <span class="n">pkg</span><span class="o">-&gt;</span><span class="n">version</span><span class="p">);</span>
    <span class="n">Str</span> <span class="n">version</span> <span class="o">=</span> <span class="n">finalize</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mem</span><span class="p">);</span>
</code></pre></div></div>

<p>Or I just print it to standard output, and the value is free to expand
beyond what would fit in memory since it flushes as it goes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">Out</span> <span class="n">out</span> <span class="o">=</span> <span class="n">newoutput</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>  <span class="c1">// 1 == standard output</span>
    <span class="n">expand</span><span class="p">(</span><span class="o">&amp;</span><span class="n">out</span><span class="p">,</span> <span class="n">global</span><span class="p">,</span> <span class="n">pkg</span><span class="p">,</span> <span class="n">pkg</span><span class="o">-&gt;</span><span class="n">version</span><span class="p">);</span>
    <span class="n">flush</span><span class="p">(</span><span class="o">&amp;</span><span class="n">out</span><span class="p">);</span>
</code></pre></div></div>

<p>I’m particularly happy about this, and I’m sure I’ll use such “arena
streams” again in the future.</p>

<h3 id="subtleties">Subtleties</h3>

<p>While pkgconf tries, and succeeds at, being a faithful (if smarter) clone,
in certain ways u-config more closely follows pkg-config’s behavior. For
example, pkg-config behaves as though it concatenates all its positional
arguments with commas in between, then re-tokenizes them like a <code class="language-plaintext highlighter-rouge">Requires</code>
field. For example, these commands are all equivalent:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pkg-config 'sdl2 &gt; 2' --libs
$ pkg-config 'sdl2 &gt;' --libs 2
$ pkg-config sdl2 --libs '&gt; 2'
$ pkg-config --libs 'sdl2 &gt; 2'
</code></pre></div></div>

<p>pkgconf does not copy this behavior, but u-config does. Similarly, the
original <code class="language-plaintext highlighter-rouge">.pc</code> format has undocumented, arcane quoting syntax that sort of
works like shell quotes. I tried to match this closely in u-config, while
pkgconf tries to be more logical. For example, pkg-config allows this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quote = "
Cflags: "-I${prefix}/include${quote}
</code></pre></div></div>

<p>Where the <code class="language-plaintext highlighter-rouge">${quote}</code> will actually close the quote. I retained this but
pkgconf did not.</p>

<p>Does anyone use quoting? On my own system I have one package using quotes,
but it’s probably a mistake since they’re used improperly. In theory,
everyone should be quoting almost everything. For example, this is a very
common <code class="language-plaintext highlighter-rouge">Cflags</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cflags: -I${prefix}/include
</code></pre></div></div>

<p>If a crazy person — or well-known multinational corporation — comes along
puts has a space in their system’s installation “prefix”, this <code class="language-plaintext highlighter-rouge">.pc</code> will
not work. The output would be:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-I/Program Files/include
</code></pre></div></div>

<p>Actually, that’s a lie. I suspect that’s the <em>intended</em> output, and it’s
the output of pkgconf and u-config, but pkg-config instead outputs this
head-scratcher:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Files/include -I/Program
</code></pre></div></div>

<p>Seeing this sort of thing repeatedly is why I have little concern with
matching every last pkg-config nuance. Regardless, this parses as two
arguments, but if written with quotes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cflags: "-I${prefix}/include"
</code></pre></div></div>

<p>Then pkg-config will escape spaces in the expansion:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-I/Program\ Files/include
</code></pre></div></div>

<p>This will actually work correctly in the <code class="language-plaintext highlighter-rouge">eval</code> context where <code class="language-plaintext highlighter-rouge">pkg-config</code>
is intended for use (read: <a href="https://github.com/skeeto/u-config/issues/1#issuecomment-1397700442"><em>not command substitution</em></a>). I’ve made
u-config automatically quote the prefix if it contains spaces, so it will
work correctly despite the lack of <code class="language-plaintext highlighter-rouge">.pc</code> file quotes when the library is
under a path containing a space.</p>

<p>Here’s a fun input. pkg-config has its own <a href="https://en.wikipedia.org/wiki/Billion_laughs_attack">billion laughs</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>v9=lol
v8=${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}
v7=${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}
v6=${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}
v5=${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}
v4=${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}
v3=${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}
v2=${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}
v1=${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}
v0=${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}
Name: One Billion Laughs
Version: ${v0}
Description: Don't install this!
</code></pre></div></div>

<p>That expands to 1,000,000,001 “lol” (an extra for good luck!) and in
theory <code class="language-plaintext highlighter-rouge">--modversion</code> will print it out:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pkg-config --modversion lol.pc
</code></pre></div></div>

<p>Some different outcomes:</p>

<ul>
  <li>
    <p>pkg-config will expand it in memory and see it to the bitter end, using
however many GiBs are necessary. Add a few more lines and your computer
will thrash. By the way, bash-completion will ask pkg-config load <code class="language-plaintext highlighter-rouge">.pc</code>
files named in the command when completing further arguments. Ask me how
I know.</p>
  </li>
  <li>
    <p>u-config could fully output it with only a few kB of memory if directed
to a “file descriptor” output, but alas, the <code class="language-plaintext highlighter-rouge">Version</code> field must be
processed in memory for comparison with another version string, so it
doesn’t attempt to do so. It runs out of arena memory and gives up.
That’s a feature, especially if you’re using bash-completion.</p>
  </li>
  <li>
    <p>pkgconf I had built with Address Sanitizer in case it found anything,
and boy did it. This input overflows a stack variable and then ASan
kills it. I’m unsure what’s supposed to happen next, but I suspect
silent truncation.</p>
  </li>
</ul>

<p>But that’s a crazy edge case right? Well, it also overflows on <em>empty
<code class="language-plaintext highlighter-rouge">.pc</code> files</em>, or for all sorts of inputs. I probed both pkg-config and
pkgconf with weird inputs to learn how it’s supposed to work, and it was
rather irritating having pkgconf crash for so many of them. Someone on the
project ought to do testing with ASan sometime. Important note: <em>This is
not a security vulnerability</em>!</p>

<p>Further, as you might notice when you build it, pkgconf first tries to
link the system <code class="language-plaintext highlighter-rouge">strlcpy</code>, if it exists. Failing that, it uses its own
version. That’s one of the annoying details about building it. However,
<a href="/blog/2021/07/30/">using <code class="language-plaintext highlighter-rouge">strlcpy</code> never, ever makes sense</a>! Now that I think about
it, there’s probably a connection with those buffer overflows.</p>

<p>In general, neither pkg-config nor pkgconf fare well when <a href="/blog/2019/01/25/">fuzz tested
with sanitizers</a>.</p>

<h3 id="conclusions">Conclusions</h3>

<p>I had a lot of fun writing u-config, and I’m excited about this new
addition to w64devkit. Despite my pkg-config grumbling, it <em>is</em> neat that
it’s established this <em>de facto</em> standard and encouraged a distributed
database of <code class="language-plaintext highlighter-rouge">.pc</code> files to exist, at least as documentation if not for a
mechanical process like this.</p>

<p>For u-config, there’s still more testing to do, and I’m still open to
picking up more behaviors from pkg-config or pkgconf where they make
sense. Though given its primary use case — building software on Windows
without a package manager — it will probably never be stressed hard enough
to matter. Further, w64devkit does not include any <code class="language-plaintext highlighter-rouge">.pc</code> files of its own,
and since I do not intend to add libraries — that is, beyond the standard
language libraries and Windows SDK — that probably won’t change.</p>

<p>If you’d like to try it early, build it with w64devkit, toss in on your
<code class="language-plaintext highlighter-rouge">PATH</code>, point <code class="language-plaintext highlighter-rouge">PKG_CONFIG_PATH</code> at a library with <code class="language-plaintext highlighter-rouge">.pc</code> files, and try it
out. It already works flawlessly with at least SDL2.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>SDL2 common mistakes and how to avoid them</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2023/01/08/"/>
    <id>urn:uuid:5b345c81-80d1-4459-981f-b5826a2bb8e7</id>
    <updated>2023-01-08T02:09:26Z</updated>
    <category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://old.reddit.com/r/C_Programming/comments/106djd0/sdl2_common_mistakes_and_how_to_avoid_them/">on reddit</a>.</em></p>

<p><a href="https://www.libsdl.org/">SDL</a> has grown on me over the past year. I didn’t understand its value
until viewing it in the right lens: as a complete platform and runtime
replacing the host’s runtime, possibly including libc. Ideally an SDL
application links exclusively against SDL and otherwise not directly
against host libraries, though in practice it’s somewhat porous. With care
— particularly in avoiding mistakes covered in this article — that ideal
is quite achievable for C applications that fit within SDL’s feature set.</p>

<!--more-->

<p>SDL applications are always interesting one way or another, so I like to
dig in when I come across them. The items in this article are mistakes
I’ve either made myself or observed across many such passion projects in
the wild.</p>

<h3 id="mistake-1-not-using-sdl2-config">Mistake 1: Not using <code class="language-plaintext highlighter-rouge">sdl2-config</code></h3>

<p>This shell script comes with SDL2 and smooths over differences between
platforms, even when cross compiling. It informs your compiler where to
find and how to link SDL2. The script even works on Windows if you have a
unix shell, such as via <a href="https://github.com/skeeto/w64devkit">w64devkit</a>. Use it as a command substitution at
the end of the build command, particularly when using <code class="language-plaintext highlighter-rouge">--libs</code>. A one-shot
or <a href="https://en.wikipedia.org/wiki/Unity_build">unity build</a> (my preference) looks like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc app.c $(sdl2-config --cflags --libs)
</code></pre></div></div>

<p>Or under separate compilation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -c app.c $(sdl2-config --cflags)
$ cc app.o $(sdl2-config --libs)
</code></pre></div></div>

<p>Alternatively, static link by replacing <code class="language-plaintext highlighter-rouge">--libs</code> with <code class="language-plaintext highlighter-rouge">--static-libs</code>,
though this is discouraged by the SDL project. When dynamically linked,
users can, and do, trivially substitute a different SDL2 binary, such as
one patched for their system. In my experience, static linking works
reliably on Windows but poorly on Linux.</p>

<p>Alternatively, use the general purpose <code class="language-plaintext highlighter-rouge">pkg-config</code>. Don’t forget <code class="language-plaintext highlighter-rouge">eval</code>!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ eval cc app.c $(pkg-config sdl2 --cflags --libs)
</code></pre></div></div>

<p>I wrote <a href="/blog/2023/01/18/">a pkg-config for Windows</a> specifically for this case.</p>

<p>Caveats:</p>

<ul>
  <li>
    <p>Some circumstances require special treatment, and <code class="language-plaintext highlighter-rouge">sdl2-config</code> may be
too blunt a tool. That’s fine, but generally prefer <code class="language-plaintext highlighter-rouge">sdl2-config</code> as the
default approach.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">sdl2-config</code> does not support extensions such as <code class="language-plaintext highlighter-rouge">SDL2_image</code>, so you
will need to use <code class="language-plaintext highlighter-rouge">pkg-config</code>. Personally I don’t think they’re worth
the trouble when there’s <a href="https://github.com/nothings/stb">stb</a>, or <a href="/blog/2022/12/18/">QOI instead of PNG</a>.</p>
  </li>
  <li>
    <p>There’s an alternative build option using CMake, without any use of
<code class="language-plaintext highlighter-rouge">sdl2-config</code>, but I won’t discuss it here.</p>
  </li>
</ul>

<h3 id="mistake-2-including-sdl2sdlh">Mistake 2: Including <code class="language-plaintext highlighter-rouge">SDL2/SDL.h</code></h3>

<p>A lot of examples, including tutorials linked from the official SDL
website, have <code class="language-plaintext highlighter-rouge">SDL2/</code> in their include paths. That’s because they’re
making mistake 1, not using <code class="language-plaintext highlighter-rouge">sdl2-config</code>, and are instead relying on
Linux distributions having installed SDL2 in a place <em>coincidentally</em>
accessible through that include path.</p>

<p>This is annoying when SDL2 <em>not</em> installed there, or if I don’t want it
using the system’s SDL2. Worse, it can result in subtly broken builds as
it mixes and matches different SDL installations. The correct SDL2 include
is the following:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span></code></pre></div></div>

<p>Note the quotes, which helps prevent picking up an arbitrary system header
by accident. When carefully and narrowly targeting SDL-the-platform, this
will be the only “system” include anywhere in your application.</p>

<h3 id="mistake-3-not-surrendering-main">Mistake 3: Not surrendering <code class="language-plaintext highlighter-rouge">main</code></h3>

<p>A conventional SDL application has a <code class="language-plaintext highlighter-rouge">main</code> function defined in its
source, but despite the name, this is distinct from C <code class="language-plaintext highlighter-rouge">main</code>. To smooth
over <a href="/blog/2022/02/18/">platform differences</a>, SDL may rename the application’s <code class="language-plaintext highlighter-rouge">main</code>
to <code class="language-plaintext highlighter-rouge">SDL_main</code> and substitute its own C <code class="language-plaintext highlighter-rouge">main</code>. Because of this, <code class="language-plaintext highlighter-rouge">main</code>
must have the conventional <code class="language-plaintext highlighter-rouge">argc</code>/<code class="language-plaintext highlighter-rouge">argv</code> prototype and must return a
value. (As a special case, C permits <code class="language-plaintext highlighter-rouge">main</code> to implicitly <code class="language-plaintext highlighter-rouge">return 0</code>, so
it’s an easy mistake to make.)</p>

<p>With this in mind, the bare minimum SDL2 application:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Caveat: Like with <code class="language-plaintext highlighter-rouge">sdl2-config</code>, some special circumstances require
control over the application entry point — see <code class="language-plaintext highlighter-rouge">SDL_MAIN_HANDLED</code> and
<code class="language-plaintext highlighter-rouge">SDL_SetMainReady</code> — but that should be reserved until there’s a need.</p>

<p>One such special case is avoiding linking a CRT on Windows. In principle
it’s this simple:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"SDL.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">SDL_SetMainReady</span><span class="p">();</span>
    <span class="c1">// ...</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then it’s <a href="/blog/2016/01/31/">the usual compiler and linker flags</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -nostdlib -o app.exe app.c $(sdl2-config --cflags --libs)
</code></pre></div></div>

<p>This will create a tiny <code class="language-plaintext highlighter-rouge">.exe</code> that doesn’t link any system DLL, just
<code class="language-plaintext highlighter-rouge">SDL2.dll</code>. Quite platform agnostic indeed!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -p app.exe | grep -Fi .dll
        DLL Name: SDL2.dll
</code></pre></div></div>

<p>Alas, as of this writing, this does not work reliably. SDL2’s accelerated
renderers on Windows do not clean up properly in <code class="language-plaintext highlighter-rouge">SDL_QuitSubSystem</code> nor
<code class="language-plaintext highlighter-rouge">SDL_Quit</code>, so the process cannot exit without calling ExitProcess in
<code class="language-plaintext highlighter-rouge">kernel32.dll</code> (or similar). This is still an open experiment.</p>

<h3 id="mistake-4-using-the-sdl-wiki-for-api-documentation">Mistake 4: Using the SDL wiki for API documentation</h3>

<p>The <a href="https://wiki.libsdl.org/SDL2/FrontPage">SDL wiki</a> is not authoritative documentation, merely a <em>convenient</em>
web-linkable — and downloadable (see “offline html”) — information source.
However, anyone who’s spent time on it can tell you it’s incomplete. The
authoritative API documentation is <em>the SDL headers</em>, which fortunately
are already on hand for building SDL applications. The SDL maintainers
<a href="https://www.youtube.com/playlist?list=PL6m6sxLnXksbqdsAcpTh4znV9j70WkmqG">themselves use the headers, not the wiki</a>.</p>

<p>If, like me, you’re using <a href="https://github.com/universal-ctags/ctags">ctags</a>, this is actually good news! With a
bit of configuration, you can jump to any bit of SDL documentation at any
time in your editor, treating the SDL headers like a hyperlinked wiki
built into your editor. Just like building, <code class="language-plaintext highlighter-rouge">sdl2-config</code> can tell ctags
where find those headers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ctags -a -R --kinds-c=dept $(sdl2-config --prefix)/include/SDL2
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">-a</code> (<code class="language-plaintext highlighter-rouge">--append</code>) to append to the tags file I’ve already
generated for my own program, <code class="language-plaintext highlighter-rouge">-R</code> (<code class="language-plaintext highlighter-rouge">--recurse</code>) to automatically find all
the headers, and <code class="language-plaintext highlighter-rouge">--kinds-c=dept</code> capture exactly the kinds of symbols I
care about — <code class="language-plaintext highlighter-rouge">#define</code>, <code class="language-plaintext highlighter-rouge">enum</code>, prototypes, <code class="language-plaintext highlighter-rouge">typedef</code> — no more no less.</p>

<p>In Vim I <code class="language-plaintext highlighter-rouge">CTRL-]</code> over any SDL symbol to jump to its documentation, and
then I can use it again within its documentation comment to jump further
still to any symbols it mentions, then finally use the jump or tag stack
to return. As long as I have <code class="language-plaintext highlighter-rouge">t</code> in <a href="https://vimdoc.sourceforge.net/htmldoc/options.html#'cpt'"><code class="language-plaintext highlighter-rouge">'complete'</code></a> (<code class="language-plaintext highlighter-rouge">'cpt'</code>), which
is the default, I can also “tab”-complete any SDL symbol using the tags
table. There are a few rough edges here and there, but overall it’s a
solid editing paradigm.</p>

<p>By the way, with <code class="language-plaintext highlighter-rouge">sdl2-config</code> in your <code class="language-plaintext highlighter-rouge">$PATH</code>, all the above works out of
the box in w64devkit! That’s where I’ve mostly been working with SDL.</p>

<h3 id="mistake-5-using-stdio-streams">Mistake 5: Using stdio streams</h3>

<p>A common bit of code in real SDL programs and virtually every tutorial:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">SDL_Init</span><span class="p">(...))</span> <span class="p">{</span>
    <span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"SDL_Init(): %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is not ideal:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">fprintf</code> is not part of the SDL platform. This is going behind SDL’s
back, reaching around the abstraction to a different platform. Strictly
speaking, this API may not even be available to an SDL application.</p>
  </li>
  <li>
    <p>SDL applications are graphical, so <code class="language-plaintext highlighter-rouge">stderr</code> is likely disconnected from
anything useful. Few would ever see this message.</p>
  </li>
</ul>

<p>Fortunately SDL provides two alternatives:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">SDL_Log</code>: like C <code class="language-plaintext highlighter-rouge">printf</code>, but SDL will strive to connect it to
somewhere useful. If the application was launched from a terminal or
console, SDL will find it and hook it up to the logger. On Windows, if
there’s a debugger attached, SDL will use <a href="https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw">OutputDebugString</a> to
send logs to the debugger.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">SDL_ShowSimpleMessageBox</code>: using any means possible, attempt to display
a message to the user. Like <code class="language-plaintext highlighter-rouge">SDL_Log</code>, it’s safe to use before/without
initializing SDL subsystems.</p>
  </li>
</ul>

<p>If you’re paranoid, you could even use both:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">SDL_Init</span><span class="p">(...))</span> <span class="p">{</span>
    <span class="n">SDL_ShowSimpleMessageBox</span><span class="p">(</span>
        <span class="n">SDL_MESSAGEBOX_ERROR</span><span class="p">,</span> <span class="s">"SDL_Init()"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">(),</span> <span class="mi">0</span>
    <span class="p">);</span>
    <span class="n">SDL_Log</span><span class="p">(</span><span class="s">"SDL_Init(): %s"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Though note that <code class="language-plaintext highlighter-rouge">SDL_ShowSimpleMessageBox</code> can fail, which will set a
new, different error message for <code class="language-plaintext highlighter-rouge">SDL_Log</code>!</p>

<p>There’s a similar story again with <code class="language-plaintext highlighter-rouge">fopen</code> and loading assets. SDL has an
I/O API, <code class="language-plaintext highlighter-rouge">SDL_RWops</code>. It’s probably better than the host’s C equivalent,
particularly with regards to paths. If you’re not already embedding your
assets, use the SDL API instead.</p>

<h3 id="mistake-6-using-sdl_renderer_accelerated">Mistake 6: Using <code class="language-plaintext highlighter-rouge">SDL_RENDERER_ACCELERATED</code></h3>

<p>This flag — and its surrounding bit set, <code class="language-plaintext highlighter-rouge">SDL_RendererFlags</code> — are a
subtle design flaw in the SDL2 API. Its existence is misleading, causing
to widespread misuse. It does not help that the documentation, both header
and wiki, is incomplete and unclear. The <code class="language-plaintext highlighter-rouge">SDL_CreateRenderer</code> function
accepts a bit set as its third argument, and it serves two simultaneous
purposes:</p>

<ul>
  <li>
    <p>Indicates <em>mandatory</em> properties of the renderer. Examples: “must use
accelerated rendering,” “must use software rendering,” “must support
vertical synchronization (vsync).” Drivers without the chosen properties
are skipped.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">SDL_RENDERER_PRESENTVSYNC</code> is set, also enables vsync in the created
render.</p>
  </li>
</ul>

<p>The common mistake is thinking that this bit indicates preference: “prefer
an accelerated renderer if possible”. But it really means “accelerated
renderer or bust.”</p>

<p>Given a zero for renderer flags, SDL will first attempt to create an
accelerated renderer. Failing that, it will then attempt to create a
software renderer. A software renderer fallback is exactly the behavior
you want! After all, this fallback is one of the primary features of the
SDL renderer API. This is so straightforward there are no caveats.</p>

<h3 id="mistake-7-not-accounting-for-vsync">Mistake 7: Not accounting for vsync</h3>

<p>For a game, you probably ought to enable vsync in your renderer. The hint:
You’re using <code class="language-plaintext highlighter-rouge">SDL_PollEvent</code> in your main event loop. Otherwise you will
waste lots of resources rendering thousands of frames per second. If my
laptop fan spins up running your SDL application, it’s probably because
you didn’t do this. The following should be the most conventional SDL
renderer configuration:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r</span> <span class="o">=</span> <span class="n">SDL_CreateRenderer</span><span class="p">(</span><span class="n">window</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">SDL_RENDERER_PRESENTVSYNC</span><span class="p">);</span>
</code></pre></div></div>

<p>The software renderer supports vsync, so it will not be excluded from the
driver search when vsync is requested.</p>

<p>That’s only for SDL renderers. If you’re using OpenGL, set a non-zero
<code class="language-plaintext highlighter-rouge">SDL_GL_SetSwapInterval</code> so that <code class="language-plaintext highlighter-rouge">SDL_GL_SwapWindow</code> synchronizes. For the
other rendering APIs, consult their documentation. (I can only speak to
SDL and OpenGL from experience.)</p>

<p>Caveat: Beware accidentally relying on vsync for timing in your game. You
don’t want your game’s physics to depend on the host’s display speed. Even
the pros make this mistake from time to time.</p>

<p>However, if you’re <em>not</em> making a game – perhaps instead an <a href="https://caseymuratori.com/blog_0001">IMGUI</a>
application <em>without active animations</em> — there’s a good chance you don’t
need or want vsync. The hint: You’re using <code class="language-plaintext highlighter-rouge">SDL_WaitEvent</code> in your main
event loop.</p>

<p>In summary, graphical SDL applications fall into one of two cases:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">SDL_PollEvent</code> with vsync</li>
  <li><code class="language-plaintext highlighter-rouge">SDL_WaitEvent</code> without vsync</li>
</ul>

<h3 id="mistake-8-using-asserth-instead-of-sdl_assert">Mistake 8: Using <code class="language-plaintext highlighter-rouge">assert.h</code> instead of <code class="language-plaintext highlighter-rouge">SDL_assert</code></h3>

<p>Alright, this one isn’t so common, but I’d like to highlight it. <strong>The
<code class="language-plaintext highlighter-rouge">SDL_assert</code> macro is fantastic</strong>, easily beating <code class="language-plaintext highlighter-rouge">assert.h</code> which
<a href="/blog/2022/06/26/">doesn’t even break in the right place</a>. It uses SDL to present a
user interface to the assertion, with support for retrying and ignoring.
It also works great under debuggers, breaking exactly as it should. I have
nothing but praise for it, so don’t pass up the chance to use it when you
can.</p>

<p>While I’m at it: during developing and testing, <em>always always always</em> run
your application under a debugger. Don’t close the debugger, just launch
through it again after rebuilding. Also, enable UBSan and ASan when
available for the extra assertions.</p>

<h3 id="sdl-wishlist">SDL wishlist</h3>

<p>For months I had wondered why SDL provides no memory allocation API. I’m
fine if it doesn’t have a general purpose allocator since I just want to
grab a chunk of host memory <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">for an arena</a>. However, SDL <em>does</em>
have allocations functions — <code class="language-plaintext highlighter-rouge">SDL_malloc</code>, etc. I didn’t know about them
until I stopped making mistake 4.</p>

<p>It was the same story again with math functions: I’d like not to stray
from SDL as a platform, but what if I need transcendental functions? I
could whip up crude implementations myself, but I’d prefer not. SDL has
those too: <code class="language-plaintext highlighter-rouge">SDL_sin</code>, etc. Caveat: The <code class="language-plaintext highlighter-rouge">math.h</code> functions are built-ins,
and compilers use that information to better optimize programs, e.g. cool
stuff like <code class="language-plaintext highlighter-rouge">-mrecip</code>, or SIMD vectorization. That cannot be done with
SDL’s equivalents.</p>

<p>I’m surprised SDL has no random number generator considering how important
it is to games. Since I <a href="/blog/2017/09/21/">prefer to handle this myself</a>, I don’t mind
that so much, but it does leave a lot of toy programs out there calling C
<code class="language-plaintext highlighter-rouge">rand</code>. I <em>would</em> like SDL if provided <a href="/blog/2019/04/30/">a single, good seed early during
startup</a>. There isn’t even a wall clock function for the classic
<code class="language-plaintext highlighter-rouge">srand(time(0))</code> seeding event! My solution has been to mix event
timestamps into the random state:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">Uint32</span> <span class="nf">rand32</span><span class="p">(</span><span class="n">Uint64</span> <span class="o">*</span><span class="p">);</span>

<span class="n">Uint64</span> <span class="n">rng</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">SDL_Event</span> <span class="n">e</span><span class="p">;</span> <span class="n">SDL_PollEvent</span><span class="p">(</span><span class="o">&amp;</span><span class="n">e</span><span class="p">);)</span> <span class="p">{</span>
    <span class="n">rng</span> <span class="o">^=</span> <span class="n">e</span><span class="p">.</span><span class="n">common</span><span class="p">.</span><span class="n">timestamp</span><span class="p">;</span>
    <span class="n">rand32</span><span class="p">(</span><span class="o">&amp;</span><span class="n">rng</span><span class="p">);</span>  <span class="c1">// stir</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As I learn more in the future, I may come back and add to this list. At
the very least I expect to use SDL increasingly in my own projects.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>QOI is now my favorite asset format</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/12/18/"/>
    <id>urn:uuid:184bb5f6-3c31-4faf-9a15-3a693b8f4c7d</id>
    <updated>2022-12-18T03:45:44Z</updated>
    <category term="c"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=34035024">on Hacker News</a>.</em></p>

<p>The <a href="https://qoiformat.org/">Quite OK Image (QOI) format</a> was announced late last year and
finalized into a specification a month later. Initially dismissive, a
revisit has shifted my opinion to impressed. The format hits a sweet spot
in the trade-off space between complexity, speed, and compression ratio.
Also considering its alpha channel support, QOI has become my default
choice for embedded image assets. It’s not perfect, but at the very least
it’s a solid foundation.</p>

<!--more-->

<p>Since I’m now working with QOI images, I need a good QOI viewer, and so I
added support to my ill-named <a href="https://github.com/skeeto/scratch/tree/master/pbmview">pbmview</a> tool, which I wrote to
serve the same purpose for <a href="https://netpbm.sourceforge.net/doc/ppm.html">Netpbm</a>. I will <a href="/blog/2020/06/29/">continue to use Netpbm
as an output format</a>, especially for raw video output, but no
longer will I use it for an embedded asset (nor re-invent yet another
<a href="https://en.wikipedia.org/wiki/Run-length_encoding">RLE</a> over Netpbm).</p>

<p>I was dismissive because the website claimed, and still claims today, QOI
images are “a similar size” to PNG. However, for the typical images where
I would use PNG, QOI is around 3x larger, and some outliers are far worse.
The 745 PNGs on my blog — a perfect test corpus for my own needs — convert
to QOIs 2.8x larger on average. The official QOI benchmark has much better
results, 1.3x larger, but that’s because it includes a lot of photography
where PNG and QOI both do poorly, making QOI seem more comparable.</p>

<p>However, as I said, QOI’s strength is its trade-off sweet spot. The
<a href="https://qoiformat.org/qoi-specification.pdf">specification is one page</a>, and an experienced developer can write
a complete implementation from scratch in a single sitting. <a href="https://github.com/skeeto/scratch/blob/master/parsers/qoi.c">My own
implementation is about 100 lines of libc-free C</a> for each of the
encoder and decoder. With error checking removed, my decoder is ~600 bytes
of x86 object code — a great story for embedding alongside assets. It’s
more complex than Netpbm or <a href="https://tools.suckless.org/farbfeld/">farbfeld</a>, but it’s far simpler than BMP.
I’ve already begun <a href="https://github.com/skeeto/chess/commit/5c123b3">experimenting with converting assets to QOI</a>,
and the results have so far exceeded my expectations.</p>

<p>To my surprise, the encoder was easier to write than the decoder. The
format is so straightforward such that two different encoders will produce
the identical files. There’s little room for specialized optimization, and
no meaningful “compression level” knob.</p>

<h3 id="criticism">Criticism</h3>

<p>There are a <a href="https://github.com/nigeltao/qoi2-bikeshed">lot of dimensions</a> on which QOI could be improved,
but most cases involve trade-offs, e.g. more complexity for better
compression. The areas where QOI could have been strictly better, the
dimensions on which it is not on the Pareto frontier, are more meaningful
criticisms — missed opportunities. My criticisms of this kind:</p>

<ul>
  <li>
    <p>Big endian fields are an odd choice for a 2020s file format. Little
endian dominates the industry, and it would have made for a slightly
smaller decoder footprint on typical machines today if QOI used little
endian.</p>
  </li>
  <li>
    <p>The header has two flags and spends an entire byte on each. It should
have instead had a flag byte, with two bits assigned to these flags. One
flag indicates if the alpha channel is important, and the other selects
between two color spaces (sRGB, linear). Both flags are only advisory.</p>
  </li>
  <li>
    <p>The 4-channel encoded pixel format is ABGR (or RGBA), placing the alpha
channel next to the blue channel. This is somewhat unconventional. A
decoder is likely to use a single load into 32-bit integer, and ideally
it’s already in the desired format or close to it. A few times already
I’ve had to shuffle the RGB bytes within the 32-bit sample to be
compatible with some other format. QOI channel ordering is arbitrary,
and I would have chosen ARGB (when viewed as little endian).</p>
  </li>
  <li>
    <p>The QOI hash function operates on channels individually, with individual
overflow, making it slower and larger than necessary. The hash function
should have been <a href="/blog/2018/07/31/">over a packed 32-bit sample</a>. I would have used
<a href="/blog/2022/08/08/#hash-functions">a multiplication</a> by a carefully-chosen 32-bit integer, then a
right shift using the highest 6 bits of the result for the index.</p>
  </li>
</ul>

<p>More subjective criticisms that might count as having trade-offs:</p>

<ul>
  <li>
    <p>Given a “flag byte” (mentioned above) it would have been free to assign
another flag bit indicating pre-multiplied alpha, also still advisory.
<a href="https://www.adriancourreges.com/blog/2017/05/09/beware-of-transparent-pixels/">You want to use pre-multiplied alpha</a> for your assets, and the
option store them this way would help.</p>
  </li>
  <li>
    <p>There’s an 8-byte end-of-stream marker — a bit excessive — deliberately
an invalid encoding so that reads past the end of the image will result
in a decoding error. I probably would have chosen a dead simple 32-bit
checksum of packed 32-bit images samples, even if literally a sum.</p>
  </li>
</ul>

<p>Of course, you’re not obligated to follow QOI exactly to spec for your own
assets, so you could always use a modified QOI with one or more of these
tweaks. That’s what I meant about it being a solid foundation: You don’t
have to start from scratch with some custom RLE. Since the format is so
simple, you can easily build your own tools — as I’ve already begun doing
myself — so you don’t need to rely on tools supporting your QOI fork.</p>

<h3 id="minimalist-api">Minimalist API</h3>

<p>I’m really happy with my QOI implementation, particularly since it’s
another example of <a href="/blog/2018/06/10/">a minimalist C API</a>: no allocating, no input or
output, and no standard library use. As usual, the expectation is that
it’s in the same translation unit where it’s used, so it’s likely inlined
into callers.</p>

<p>The encoder is streaming — it accepts and returns only a little bit of
input and output at a time. It has three functions and one struct with no
“public” fields:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="nf">qoiencoder</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">w</span><span class="p">,</span> <span class="kt">int</span> <span class="n">h</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">flags</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">qoiencode</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">color</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">qoifinish</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoiencoder</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>The first function initializes an encoder and writes a fixed-length header
into the QOI buffer. The <code class="language-plaintext highlighter-rouge">flags</code> field is a mode string, like <code class="language-plaintext highlighter-rouge">fopen</code>. I
would normally use bit flags, but this is <a href="https://flak.tedunangst.com/post/string-interfaces">a little experiment</a>. The
second function encodes a single pixel into the QOI buffer, returning the
number of bytes written (possibly zero). The last flushes any encoding
state and writes the end-of-stream marker. There are no errors. My typical
use so far looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">qoiencoder</span> <span class="n">q</span> <span class="o">=</span> <span class="n">qoiencoder</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">QOIHDRLEN</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ... compute 32-bit ABGR sample at (x, y) ...</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">qoiencode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">abgr</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">qoifinish</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">buf</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="n">fflush</span><span class="p">(</span><span class="n">file</span><span class="p">);</span>
<span class="k">return</span> <span class="nf">ferror</span><span class="p">(</span><span class="n">file</span><span class="p">);</span>
</code></pre></div></div>

<p>This appends encoder outputs to a buffered stream, but it could just as
well accumulate directly into a larger buffer, advancing the write pointer
a little after each call.</p>

<p>The decoder is two functions, but its struct has some “public” fields.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">qoidecoder</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">;</span>
    <span class="kt">_Bool</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">srgb</span><span class="p">,</span> <span class="n">error</span><span class="p">;</span>
    <span class="c1">// ...</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="n">qoidecoder</span> <span class="nf">qoidecoder</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="nf">qoidecode</span><span class="p">(</span><span class="k">struct</span> <span class="n">qoidecoder</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The input is not streamed and the entire buffer must be loaded into memory
at once — not too bad since it’s compressed, and perhaps even already
loaded as part of the executable image — but the output <em>is</em> streamed,
delivering one packed 32-bit ABGR sample per call. The decoder makes no
assumptions about the output format, and the caller unpacks samples and
stores them in whatever format is appropriate (shader texture, etc.).</p>

<p>To make it easier to use, my decoder range checks to guarantee that width
and height <a href="/blog/2017/07/19/">can be multiplied without overflow</a>. Unlike encoding,
there may be errors due to invalid input, including that failed range
check. The decoder error flag is “sticky” and the decoder returns zero
samples when in an error state, so callers can wait to check for errors
until the end. (Though if you’re only decoding embedded assets, then there
are no practical errors, and checks can be removed/ignored.)</p>

<p>Example usage, copied almost verbatim from a real program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">loadimage</span><span class="p">(</span><span class="n">Image</span> <span class="o">*</span><span class="n">image</span><span class="p">,</span> <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">qoi</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">qoidecoder</span> <span class="n">q</span> <span class="o">=</span> <span class="n">qoidecoder</span><span class="p">(</span><span class="n">qoi</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* image dimensions too large */</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">image</span><span class="o">-&gt;</span><span class="n">width</span>  <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">width</span><span class="p">;</span>
    <span class="n">image</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">count</span> <span class="o">=</span> <span class="n">q</span><span class="p">.</span><span class="n">width</span> <span class="o">*</span> <span class="n">q</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="n">abgr</span> <span class="o">=</span> <span class="n">qoidecode</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
        <span class="n">image</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">abgr</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="o">!</span><span class="n">q</span><span class="p">.</span><span class="n">error</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the aforementioned awkward RGB shuffle.</p>

<p>It’s safe to say that I’m excited about QOI, and that it now has a
permanent slot on my developer toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>I solved the Dandelions paper-and-pencil game</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/12/"/>
    <id>urn:uuid:14edf491-dcdd-4c2f-a75f-5e89838e6b40</id>
    <updated>2022-10-12T03:02:27Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I’ve been reading <a href="https://mathwithbaddrawings.com/2022/01/19/math-games-with-bad-drawings-2/"><em>Math Games with Bad Drawings</em></a>, a great book
well-aligned to my interests. It’s given me a lot of new, interesting
programming puzzles to consider. The first to truly nerd snipe me was
<a href="https://mathwithbaddrawings.com/dandelions/">Dandelions</a> (<a href="https://mathwithbaddrawings.com/wp-content/uploads/2020/06/game-5-dandelions-1.pdf">full rules</a>), an asymmetric paper-and-pencil game
invented by the book’s author, Ben Orlin. Just as with <a href="/blog/2020/10/19/">British Square two
years ago</a> — and essentially following the same technique — I wrote a
program that explores the game tree sufficiently to play either side
perfectly, “solving” the game in its standard 5-by-5 configuration.</p>

<p>The source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/dandelions.c"><code class="language-plaintext highlighter-rouge">dandelions.c</code></a></strong></p>

<p>The game is played on a 5-by-5 grid where one player plays the dandelions,
the other plays the wind. Players alternate, dandelions placing flowers
and wind blowing in one of the eight directions, spreading seeds from all
flowers along the direction of the wind. Each side gets seven moves, and
the wind cannot blow in the same direction twice. The dandelions’ goal is
to fill the grid with seeds, and the wind’s goal is to prevent this.</p>

<p>Try playing a few rounds with a friend, and you will probably find that
dandelions is difficult, at least in your first games, as though it cannot
be won. However, my engine proves the opposite: <strong>The dandelions always
win with perfect play.</strong> In fact, it’s so lopsided that the dandelions’
first move is irrelevant. Every first move is winnable. If the dandelions
blunder, typically wind has one narrow chance to seize control, after
which wind probably wins with any (or almost any) move.</p>

<p>For reasons I’ll discuss later, I only solved the 5-by-5 game, and the
situation may be different for the 6-by-6 variant. Also, unlike British
Square, my engine does not exhaustively explore the entire game tree
because it’s far too large. Instead it does a minimax search to the bottom
of the tree and stops when it finds a branch where all leaves are wins for
the current player. Because of this, it cannot maximize the outcome —
winning as early as possible as dandelions or maximizing the number of
empty grid spaces as wind. I also can’t quantify the exact size of tree.</p>

<p>Like with British Square, my game engine only has a crude user interface
for interactively exploring the game tree. While you can “play” it in a
sense, it’s not intended to be played. It also takes a few seconds to
initially explore the game tree, so wait for the <code class="language-plaintext highlighter-rouge">&gt;&gt;</code> prompt.</p>

<h3 id="bitboard-seeding">Bitboard seeding</h3>

<p>I used <a href="https://www.chessprogramming.org/Bitboards">bitboards</a> of course: a 25-bit bitboard for flowers, a 25-bit
bitboard for seeds, and an 8-bit set to track which directions the wind
has blown. It’s especially well-suited for this game since seeds can be
spread in parallel using bitwise operations. Shift the flower bitboard in
the direction of the wind four times, ORing it into the seeds bitboard
on each shift:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int wind;
uint32_t seeds, flowers;

flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
</code></pre></div></div>

<p>Of course it’s a little more complicated than this. The flowers must be
masked to keep them from wrapping around the grid, and wind may require
shifting in the other direction. In order to “negative shift” I actually
use a rotation (notated with <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code> below). Consider, to rotate an N-bit
integer <em>left</em> by R, one can <em>right</em>-rotate it by <code class="language-plaintext highlighter-rouge">N-R</code> — ex. on a 32-bit
integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a
negative <code class="language-plaintext highlighter-rouge">wind</code> that goes in the other direction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>flowers &gt;&gt;&gt; (wind &amp; 31);
</code></pre></div></div>

<p>With such a “programmable shift” I can implement the bulk of the game
rules using a couple of tables and no branches:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
</code></pre></div></div>

<p>The masks clear out the column/row about to be shifted “out” so that it
doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5
times.</p>

<h3 id="bitboard-packing-and-canonicalization">Bitboard packing and canonicalization</h3>

<p>The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58
bits, which fits in a 64-bit integer with bits to spare. How incredibly
convenient! So I represent the game state using a 64-bit integer, using a
packing like I did with British Square. The bottom 25 bits are the seeds,
the next 25 bits are the flowers, and the next 8 is the wind set.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS
</code></pre></div></div>

<p>Even more convenient, I could reuse my bitboard canonicalization code from
British Square, also a 5-by-5 grid packed in the same way, saving me the
trouble of working out all the bit sieves. I only had to figure out how to
transpose and flip the wind bitset. Turns out that’s pretty easy, too.
Here’s how I represent the 8 wind directions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>567
4 0
321
</code></pre></div></div>

<p>Flipping this vertically I get:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>321
4 0
567
</code></pre></div></div>

<p>Unroll these to show how old maps onto new:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>old: 01234567
new: 07654321
</code></pre></div></div>

<p>The new is just the old rotated and reversed. Transposition is the same
story, just a different rotation. I use a small lookup table to reverse
the bits, and then an 8-bit rotation. (See <code class="language-plaintext highlighter-rouge">revrot</code>.)</p>

<p>To determine how many moves have been made, popcount the flower bitboard
and wind bitset.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int moves = POPCOUNT64(g &amp; 0x3fffffffe000000);
</code></pre></div></div>

<p>To test if dandelions have won:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int win = (g&amp;0x1ffffff) == 0x1ffffff;
</code></pre></div></div>

<p>Since the plan is to store all the game states in a big hash table — an
<a href="/blog/2022/08/08/">MSI double hash</a> in this case — I’d like to reserve the zero value
as a “null” board state. This lets me zero-initialize the hash table. To
do this, I invert the wind bitset such that a 1 indicates the direction is
still available. So the initial game state looks like this (in the real
program this is accounted for in the previously-discussed turn popcount):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define GAME_INIT ((uint64_t)255 &lt;&lt; 50)
</span></code></pre></div></div>

<p>The remaining 6 bits can be used to cache information about the rest of
tree under this game state, namely who wins from this position, and this
serves as the “value” in the hash table. Turns out the bitboards are
already noisy enough that a <a href="/blog/2018/07/31/">single xorshift</a> makes for a great hash
function. The hash table, including hash function, is under a dozen lines
of code.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Find the hash table slot for the given game state.</span>
<span class="kt">uint64_t</span> <span class="o">*</span><span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">g</span> <span class="o">^</span> <span class="n">g</span><span class="o">&gt;&gt;</span><span class="mi">32</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1L</span> <span class="o">&lt;&lt;</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span><span class="o">&amp;</span><span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="mh">0x3ffffffffffffff</span> <span class="o">==</span> <span class="n">g</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">ht</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To explore a 6-by-6 grid I’d need to change my representation, which is
part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit
integer, so I’d need to double my storage requirements, which are already
strained.</p>

<h3 id="computational-limitations">Computational limitations</h3>

<p>Due to the way seeds spread, game states resulting from different moves
rarely converge back to a common state later in the tree, so the hash
table isn’t doing much deduplication. Exhaustively exploring the entire
game tree, even cutting it down to an 8th using canonicalization, requires
substantial computing resources, more than I personally have available for
this project. So I had to stop at the slightly weaker form, find a winning
branch rather than maximizing a “score.”</p>

<p>I configure the program to allocate 2GiB for the hash table, but if you
run just a few dozen games off the same table (same program instance),
each exploring different parts of the game tree, you’ll exhaust this
table. A 6-by-6 doubles the memory requirements just to represent the
game, but it also slows the search and substantially increases the width
of the tree, which grows 44% faster. I’m sure it can be done, but it’s
just beyond the resources available to me.</p>

<h3 id="dandelion-puzzles">Dandelion Puzzles</h3>

<p>As a side effect, I wrote a small routine to randomly play out games in
search for “mate-in-two”-style puzzles. The dandelions have two flowers to
place and can force a win with two specific placements — and only those
two placements — regardless of how the wind blows. Here are two of the
better ones, each involving a small trick that I won’t give away here
(note: arrowheads indicate directions wind can still blow):</p>

<p><img src="/img/dandelions/puzzle1.svg" alt="" /></p>

<p><img src="/img/dandelions/puzzle2.svg" alt="" /></p>

<p>There are a variety of potential single-player puzzles of this form.</p>

<ul>
  <li>Cooperative: place a dandelion <em>and</em> pick the wind direction</li>
  <li>Avoidance: <em>don’t</em> seed a particular tile</li>
  <li>Hard ground: certain tiles can’t grow flowers (but still get seeded)</li>
  <li>Weeding: as wind, figure out which flower to remove before blowing</li>
</ul>

<p>There could be a whole “crossword book” of such dandelion puzzles.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to build a WaitGroup from a 32-bit integer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/05/"/>
    <id>urn:uuid:cc83b101-2d77-42b8-b409-d4ed36831479</id>
    <updated>2022-10-05T03:19:07Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Go has a nifty synchronization utility called a <a href="https://godocs.io/sync#WaitGroup">WaitGroup</a>, on which
one or more goroutines can wait for concurrent task completion. In other
languages, the usual task completion convention is <em>joining</em> threads doing
the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup
replaces joins. Building a WaitGroup using typical, portable primitives is
a messy affair involving constructors and destructors, managing lifetimes.
However, on at least Linux and Windows, we can build a WaitGroup out of a
zero-initialized integer, much like my <a href="/blog/2022/05/14/">32-bit queue</a> and <a href="/blog/2022/03/13/">32-bit
barrier</a>.</p>

<p>In case you’re not familiar with it, a typical WaitGroup use case in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">wg</span> <span class="n">sync</span><span class="o">.</span><span class="n">WaitGroup</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">task</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">tasks</span> <span class="p">{</span>
    <span class="n">wg</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
    <span class="k">go</span> <span class="k">func</span><span class="p">(</span><span class="n">t</span> <span class="n">Task</span><span class="p">)</span> <span class="p">{</span>
        <span class="c">// ... do task ...</span>
        <span class="n">wg</span><span class="o">.</span><span class="n">Done</span><span class="p">()</span>
    <span class="p">}(</span><span class="n">task</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">wg</span><span class="o">.</span><span class="n">Wait</span><span class="p">()</span>
</code></pre></div></div>

<p>I zero-initialize the WaitGroup, the main goroutine increments the counter
before starting each task goroutine, each goroutine decrements the counter
when done, and the main goroutine waits until the counter reaches zero. My
goal is to build the same mechanism in C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">workfunc</span><span class="p">(</span><span class="n">task</span> <span class="n">t</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ... do task ...</span>
    <span class="n">waitgroup_done</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">waitgroup_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When it’s done, the WaitGroup is back to zero, and no cleanup is required.</p>

<p>I’m going to take it a little further than that: Since its meaning and
contents are explicit, you may initialize a WaitGroup to any non-negative
task count! In other words, <code class="language-plaintext highlighter-rouge">waitgroup_add</code> is optional if the total
number of tasks is known up front.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">wg</span> <span class="o">=</span> <span class="n">ntasks</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ntasks</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">go</span><span class="p">(</span><span class="n">workfunc</span><span class="p">,</span> <span class="n">tasks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">waitgroup_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">wg</span><span class="p">);</span>
</code></pre></div></div>

<p>A sneak peek at the full source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/waitgroup.c"><code class="language-plaintext highlighter-rouge">waitgroup.c</code></a></strong></p>

<h3 id="the-four-elements-of-synchronization">The four elements (of synchronization)</h3>

<p>To build this WaitGroup, we’re going to need four primitives from the host
platform, each operating on an <code class="language-plaintext highlighter-rouge">int</code>. The first two are atomic operations,
and the second two interact with the system scheduler. To port the
WaitGroup to a platform you need only implement these four functions,
typically as one-liners.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>  <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// atomic load</span>
<span class="k">static</span> <span class="kt">int</span>  <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>  <span class="c1">// atomic add-then-fetch</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>      <span class="c1">// wait on change at address</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>           <span class="c1">// wake all waiters by address</span>
</code></pre></div></div>

<p>The first two should be self-explanatory. The <code class="language-plaintext highlighter-rouge">wait</code> function waits for
the pointed-at integer to change its value, and the second argument is its
expected current value. The scheduler will double-check the integer before
putting the thread to sleep in case it changes at the last moment — in
other words, an atomic check-then-maybe-sleep. The <code class="language-plaintext highlighter-rouge">wake</code> function is the
other half. After changing the integer, a thread uses it to wake all
threads waiting for the pointed-at integer to change. Together, this
mechanism is known as a <em>futex</em>.</p>

<p>I’m going to simplify the WaitGroup semantics a bit in order to make my
implementation even simpler. Go’s WaitGroup allows adding negatives, and
the <code class="language-plaintext highlighter-rouge">Add</code> method essentially does double-duty. My version forbids adding
negatives. That means the “add” operation is just an atomic increment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_add</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">,</span> <span class="kt">int</span> <span class="n">delta</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">delta</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since it cannot bring the counter to zero, there’s nothing else to do. The
“done” operation <em>can</em> decrement to zero:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_done</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">addfetch</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">wake</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the atomic decrement brought the count to zero, we finished the last
task, so we need to wake the waiters. We don’t know if anyone is actually
waiting, but that’s fine. Some futex use cases will avoid making the
relatively expensive system call if nobody’s waiting — i.e. don’t waste
time on a system call for each unlock of an uncontended mutex — but in the
typical WaitGroup case we <em>expect</em> a waiter when the count finally goes to
zero. That’s the common case.</p>

<p>The most complicated of the three is waiting:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">waitgroup_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">wg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">load</span><span class="p">(</span><span class="n">wg</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">wait</span><span class="p">(</span><span class="n">wg</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First check if the count is already zero and return if it is. Otherwise
use the futex to <em>wait for it to change</em>. Unfortunately that’s not exactly
the semantics we want, which would be to wait for a certain target. This
doesn’t break the wait, but it’s a potential source of inefficiency. If a
thread finishes a task between our load and wait, we don’t go to sleep,
and instead try again. However, in practice, I ran thousands of threads
through this thing concurrently and I couldn’t observe such a “miss.” As
far as I can tell, it’s so rare it doesn’t matter.</p>

<p>If this was a concern, the WaitGroup could instead be a pair of integers:
the counter and a “latch” that is either 0 or 1. Waiters wait on the
latch, and the latch is modified (atomically) when the counter transitions
to or from zero. That gives waiters a stable value on which to wait,
proxying the counter. However, since this doesn’t seem to matter in
practice, I prefer the elegance and simplicity of the single-integer
WaitGroup.</p>

<h3 id="four-elements-linux">Four elements: Linux</h3>

<p>With the WaitGroup done at a high level, we now need the per-platform
parts. Both GCC and Clang support <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html">GNU-style atomics</a>, so I’ll just
assume these are available on Linux without worrying about the compiler.
The first two functions wrap these built-ins:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_load_n</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">__atomic_add_fetch</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">,</span> <span class="n">__ATOMIC_SEQ_CST</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">wait</code> and <code class="language-plaintext highlighter-rouge">wake</code> we need the <a href="https://man7.org/linux/man-pages/man2/futex.2.html"><code class="language-plaintext highlighter-rouge">futex(2)</code> system call</a>. In an
attempt to discourage its direct use, glibc doesn’t wrap this system call
in a function, so we must make the system call ourselves.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAIT</span><span class="p">,</span> <span class="n">current</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall</span><span class="p">(</span><span class="n">SYS_futex</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">FUTEX_WAKE</span><span class="p">,</span> <span class="n">INT_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">INT_MAX</code> means “wake as many as possible.” The other common value is
1 for waking a single waiter. Also, these system calls can’t meaningfully
fail, so there’s no need to check the return value. If <code class="language-plaintext highlighter-rouge">wait</code> wakes up
early (e.g. <code class="language-plaintext highlighter-rouge">EINTR</code>), it’s going to check the counter again anyway. In
fact, if your kernel is more than 20 years old, predating futexes, and
returns <code class="language-plaintext highlighter-rouge">ENOSYS</code> (“Function not implemented”), it will <em>still</em> work
correctly, though it will be incredibly inefficient.</p>

<h3 id="four-elements-windows">Four elements: Windows</h3>

<p>Windows didn’t support futexes until Windows 8 in 2012, and were still
supporting Windows without it into 2020, so they’re still relatively “new”
for this platform. Nonetheless, they’re now mature enough that we can
count on them being available.</p>

<p>I’d like to support both GCC-ish (<a href="https://github.com/skeeto/w64devkit">via Mingw-w64</a>) and MSVC-ish
compilers. Mingw-w64 provides a compatible <code class="language-plaintext highlighter-rouge">intrin.h</code>, so I can stick to
MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t
define atomics for <code class="language-plaintext highlighter-rouge">int</code> (or even <code class="language-plaintext highlighter-rouge">int32_t</code>), strictly <code class="language-plaintext highlighter-rouge">long</code>, so I have
to sneak in a little cast. (Recall: <code class="language-plaintext highlighter-rouge">sizeof(long) == sizeof(int)</code> on every
version of Windows supporting futexes.) The other option is to <code class="language-plaintext highlighter-rouge">typedef</code>
the WaitGroup so that it’s <code class="language-plaintext highlighter-rouge">int</code> on Linux (for the futex) and <code class="language-plaintext highlighter-rouge">long</code> on
Windows (for atomics).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">load</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">_InterlockedOr</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">addfetch</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">addend</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">addend</span> <span class="o">+</span> <span class="n">_InterlockedExchangeAdd</span><span class="p">((</span><span class="kt">long</span> <span class="o">*</span><span class="p">)</span><span class="n">p</span><span class="p">,</span> <span class="n">addend</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The official, sanctioned futex functions are <a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress">WaitOnAddress</a> and
<a href="https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall">WakeByAddressAll</a>. They <a href="https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/CALK-3m%2B6tX_ubMVGV7NarAm6VH0AoOp5THyXfEUA%3DTjyu5L%3Dxw%40mail.gmail.com/">used to be in <code class="language-plaintext highlighter-rouge">kernel32.dll</code></a>, but as of
this writing they live in <code class="language-plaintext highlighter-rouge">API-MS-Win-Core-Synch-l1-2-0.dll</code>, linked via
<code class="language-plaintext highlighter-rouge">-lsynchronization</code>. Gross. Since I can’t stomach this, I instead call the
low-level RTL functions where it’s actually implemented: RtlWaitOnAddress
and RtlWakeAddressAll. These live in the nice neighborhood of <code class="language-plaintext highlighter-rouge">ntdll.dll</code>.
They’re undocumented as far as I can tell, but thankfully <a href="https://github.com/wine-mirror/wine/blob/master/dlls/ntdll/sync.c">Wine comes to
the rescue</a>, providing both documentation and several different
implementations. Reading through it is educational, and hints at ways to
construct futexes on systems lacking them.</p>

<p>These functions aren’t declared in any headers, so I have to do it myself.
On the plus side, so far I haven’t paid the substantial compile-time costs
of <a href="https://web.archive.org/web/20090912002357/http://www.tilander.org/aurora/2008/01/include-windowsh.html">including <code class="language-plaintext highlighter-rouge">windows.h</code></a>, and so I can continue avoiding it. These
functions <em>are</em> listed in the <code class="language-plaintext highlighter-rouge">ntdll.dll</code> import library, so I don’t need
to <a href="/blog/2021/05/31/">invent the import library entries</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWaitOnAddress</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="kr">__stdcall</span> <span class="nf">RtlWakeAddressAll</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Rather conveniently, the semantics perfectly line up with Linux futexes!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">current</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWaitOnAddress</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">),</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">wake</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlWakeAddressAll</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like with Linux, there’s no meaningful failure, so the return values don’t
matter.</p>

<p>That’s the whole implementation. Considering just a single platform, a
flexible, lightweight, and easy-to-use synchronization facility in ~50
lines of relatively simple code is a pretty good deal if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Illuminating synchronization edges for ThreadSanitizer</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/03/"/>
    <id>urn:uuid:a008900c-cf6a-46e2-8657-21bded194350</id>
    <updated>2022-10-03T03:09:38Z</updated>
    <category term="c"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p><a href="https://github.com/google/sanitizers/wiki">Sanitizers</a> are powerful development tools which complement
<a href="/blog/2022/06/26/">debuggers</a> and <a href="/blog/2019/01/25/">fuzzing</a>. I typically have at least one sanitizer
active during development. They’re particularly useful during code review,
where they can identify issues before I’ve even begun examining the code
carefully — sometimes in mere minutes under fuzzing. Accordingly, it’s a
good idea to have your own code in good agreement with sanitizers before
review. For ThreadSanitizer (TSan), that means dealing with false
positives in programs relying on synchronization invisible to TSan.</p>

<p>This article’s motivation is multi-threaded <a href="https://man7.org/linux/man-pages/man7/epoll.7.html">epoll</a>. I mitigate TSan
false positives each time it comes up, enough to have gotten the hang of
it, so I ought to document it. <a href="https://github.com/skeeto/w64devkit">On Windows</a> I would also run into the
same issue with the Win32 message queue, crossing the synchronization edge
between <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-postmessagea">PostMessage</a> (release) and <a href="https://learn.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-getmessage">GetMessage</a> (acquire), <em>except</em>
for the general lack of TSan support in Windows tooling. The same
technique would work there as well.</p>

<p>My typical epoll scenario looks like so:</p>

<ol>
  <li>Create an epoll file descriptor (<code class="language-plaintext highlighter-rouge">epoll_create1</code>).</li>
  <li>Create worker threads, passing the epoll file descriptor.</li>
  <li>Worker threads loop on <code class="language-plaintext highlighter-rouge">epoll_wait</code>.</li>
  <li>Main thread loops on <code class="language-plaintext highlighter-rouge">accept</code>, adding sockets to epoll (<code class="language-plaintext highlighter-rouge">epoll_ctl</code>).</li>
</ol>

<p>Between <code class="language-plaintext highlighter-rouge">accept</code> and <code class="language-plaintext highlighter-rouge">EPOLL_CTL_ADD</code>, the main thread allocates and
initializes the client session state, then attaches it to the epoll event.
The client socket is added with <a href="https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/">the <code class="language-plaintext highlighter-rouge">EPOLLONESHOT</code> flag</a>, and the
session state is not touched after the call to <code class="language-plaintext highlighter-rouge">epoll_ctl</code> (note: sans
error checks):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">accept</span><span class="p">(...);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="p">...;</span>
    <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span> <span class="o">=</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="k">struct</span> <span class="n">epoll_event</span><span class="p">;</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span> <span class="o">=</span> <span class="n">EPOLLET</span> <span class="o">|</span> <span class="n">EPOLLONESHOT</span> <span class="o">|</span> <span class="p">...;</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span> <span class="o">=</span> <span class="n">session</span><span class="p">;</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this example, <code class="language-plaintext highlighter-rouge">struct session</code> is defined by the application to contain
all the state for handling a session (file descriptor, buffers, <a href="/blog/2020/12/31/">state
machine</a>, parser state, <a href="https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator">allocation arena</a>, etc.). Everything
else is part of the epoll interface.</p>

<p>When a socket is ready, one of the worker threads receive it. Due to
<code class="language-plaintext highlighter-rouge">EPOLLONESHOT</code>, it’s immediately disabled and no other thread can receive
it. The thread does as much work as possible (i.e. read/write until
<code class="language-plaintext highlighter-rouge">EAGAIN</code>), then reactivates it with <code class="language-plaintext highlighter-rouge">epoll_ctl</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">epoll_event</span> <span class="n">event</span><span class="p">;</span>
    <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">event</span><span class="p">.</span><span class="n">events</span> <span class="o">=</span> <span class="n">EPOLLET</span> <span class="o">|</span> <span class="n">EPOLLONESHOT</span> <span class="o">|</span> <span class="p">...;</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared variables in <code class="language-plaintext highlighter-rouge">session</code> are passed between threads through
<code class="language-plaintext highlighter-rouge">epoll</code> using the event’s <code class="language-plaintext highlighter-rouge">.user.ptr</code>. These variables are potentially
read and mutated by every thread, but it’s all perfectly safe without any
further synchronization — i.e. no need for mutexes, etc. All the necessary
synchronization is implicit in epoll.</p>

<p>In the initial hand-off, that <code class="language-plaintext highlighter-rouge">EPOLL_CTL_ADD</code> must <em>happen before</em> the
corresponding <code class="language-plaintext highlighter-rouge">epoll_wait</code> in a worker thread. This establishes that the
main thread and worker thread do not touch session variables concurrently.
After all, how could the worker see an event on the file descriptor before
it’s been added to epoll? The synchronization in epoll itself will also
ensure all the architecture-level stores are visible to other threads
before the hand-off. We can call the “add” a <em>release</em> and the “wait” an
<em>acquire</em>, forming a synchronization edge.</p>

<p>Similarly, in the hand-off between worker threads, the <code class="language-plaintext highlighter-rouge">EPOLL_CTL_MOD</code>
that reactivates the file descriptor must <em>happen before</em> the wait that
observes the next event because, until reactivation, it’s disabled. The
<code class="language-plaintext highlighter-rouge">EPOLL_CTL_MOD</code> is another <em>release</em> in relation to the <em>acquire</em> wait.</p>

<p>Unfortunately TSan won’t see things this way. It can’t see into the
kernel, and it doesn’t know these subtle epoll semantics, so it can’t see
these synchronization edges. As far <a href="https://www.youtube.com/watch?v=5erqWdlhQLA">as it can tell</a>, threads might be
accessing a session concurrently, and TSan will reliably produce warnings
about it. You could shrug your shoulders and give up on using TSan in this
case, but there’s an easy solution: introduce redundant, semantically
identical synchronization edges, but only when TSan is looking.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARNING: ThreadSanitizer: data race
</code></pre></div></div>

<h3 id="redundant-synchronization">Redundant synchronization</h3>

<p>I prefer to solve this by introducing the weakest possible synchronization
so that I’m not synchronizing beyond epoll’s semantics. This will help
TSan catch real mistakes that stronger synchronization might hide.</p>

<p>The weakest option is memory fences. These wouldn’t introduce extra loads
or stores. At most it would be a fence instruction. I would use <a href="https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/_005f_005fatomic-Builtins.html">GCC’s
built-in <code class="language-plaintext highlighter-rouge">__atomic_thread_fence</code></a> for the job. However, TSan does not
currently understand thread fences, so that defeats the purpose. Instead,
I introduce a new field to <code class="language-plaintext highlighter-rouge">struct session</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">session</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="kt">int</span> <span class="n">_sync</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Then just before <code class="language-plaintext highlighter-rouge">epoll_ctl</code> I’ll do a <em>release</em> store on this field,
“releasing” the session. All session stores are ordered before the
release.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// main thread</span>
    <span class="c1">// ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">)</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>

    <span class="c1">// worker thread</span>
    <span class="c1">// ...</span>
    <span class="n">__atomic_store_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">__ATOMIC_RELEASE</span><span class="p">)</span>
    <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
</code></pre></div></div>

<p>After <code class="language-plaintext highlighter-rouge">epoll_wait</code> I add an <em>acquire</em> load, “acquiring” the session. All
session loads are ordered after the acquire.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="n">__atomic_load_n</span><span class="p">(</span><span class="o">&amp;</span><span class="n">session</span><span class="o">-&gt;</span><span class="n">_sync</span><span class="p">,</span> <span class="n">__ATOMIC_ACQUIRE</span><span class="p">)</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
</code></pre></div></div>

<p>For this to work, the thread must not touch session variables in any way
before the acquire or after the release. For example, note how I obtained
the client file descriptor before the release, i.e. no <code class="language-plaintext highlighter-rouge">session-&gt;fd</code>
argument in the <code class="language-plaintext highlighter-rouge">epoll_ctl</code> call.</p>

<p>That’s it! This redundantly establishes the <em>happens before</em> relationship
already implicit in epoll, but now it’s visible to TSan. However, I don’t
want to pay for this unless I’m actually running under TSan, so some
macros are in order. <code class="language-plaintext highlighter-rouge">__SANITIZE_THREAD__</code> is automatically defined when
running under TSan:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if __SANITIZE_THREAD__
# define TSAN_SYNCED     int _sync
# define TSAN_ACQUIRE(s) __atomic_load_n(&amp;(s)-&gt;_sync, __ATOMIC_ACQUIRE)
# define TSAN_RELEASE(s) __atomic_store_n(&amp;(s)-&gt;_sync, 0, __ATOMIC_RELEASE)
#else
# define TSAN_SYNCED
# define TSAN_ACQUIRE(s)
# define TSAN_RELEASE(s)
#endif
</span></code></pre></div></div>

<p>This also makes it more readable, and intentions clearer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">session</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">TSAN_SYNCED</span><span class="p">;</span>
<span class="p">};</span>

    <span class="c1">// main thread</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="c1">// ...</span>
        <span class="n">TSAN_RELEASE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_ADD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="c1">// worker thread</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">epoll_wait</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">session</span> <span class="o">*</span><span class="n">session</span> <span class="o">=</span> <span class="n">event</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
        <span class="n">TSAN_ACQUIRE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">session</span><span class="o">-&gt;</span><span class="n">fd</span><span class="p">;</span>
        <span class="c1">// ...</span>
        <span class="n">TSAN_RELEASE</span><span class="p">(</span><span class="n">session</span><span class="p">);</span>
        <span class="n">epoll_ctl</span><span class="p">(</span><span class="n">epfd</span><span class="p">,</span> <span class="n">EPOLL_CTL_MOD</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">event</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Now I can use TSan again, and it didn’t cost anything in normal builds.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The quick and practical "MSI" hash table</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/08/08/"/>
    <id>urn:uuid:4a7d8c3d-3bcf-4b10-b50a-64227c02b254</id>
    <updated>2022-08-08T23:57:08Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Follow-up: <a href="/blog/2023/06/26/">Solving “Two Sum” in C with a tiny hash table</a></em></p>

<p>I <a href="https://skeeto.s3.amazonaws.com/share/onward17-essays2.pdf">generally prefer C</a>, so I’m accustomed to building whatever I need
on the fly, such as heaps, <a href="/blog/2022/05/22/#inverting-the-tree-links">linked lists</a>, and especially hash
tables. Few programs use more than a small subset of a data structure’s
features, making their implementation smaller, simpler, and <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">more
efficient</a> than the general case, which must handle every edge
case. A typical hash table tutorial will describe a relatively lengthy
program, but in practice, bespoke hash tables are <a href="/blog/2020/10/19/#hash-table-memoization">only a few lines of
code</a>. Over the years I’ve worked out some basic principles for hash
table construction that aid in quick and efficient implementation. This
article covers the technique and philosophy behind what I’ve come to call
the “mask-step-index” (MSI) hash table, which is my standard approach.</p>

<!--more-->

<p>MSI hash tables are nothing novel, just a <a href="https://en.wikipedia.org/wiki/Double_hashing">double hashed</a>, <a href="https://en.wikipedia.org/wiki/Open_addressing">open
address</a> hash table layered generically atop an external array. It’s
best regarded as a kind of database index — <em>a lookup index over an
existing array</em>. The array exists independently, and the hash table
provides an efficient lookup into that array over some property of its
entries.</p>

<p>The core of the MSI hash table is this iterator function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Compute the next candidate index. Initialize idx to the hash.</span>
<span class="kt">int32_t</span> <span class="nf">ht_lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">hash</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">idx</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">((</span><span class="kt">uint32_t</span><span class="p">)</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">step</span> <span class="o">=</span> <span class="p">(</span><span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">exp</span><span class="p">))</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">idx</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The name should now make sense. I literally sound it out in my head when I
type it, like a mnemonic. Compute a mask, then a step size, finally an
index. The <code class="language-plaintext highlighter-rouge">exp</code> parameter is a power-of-two exponent for the hash table
size, <a href="/blog/2022/05/14/">which may look familiar</a>. I’ve used <code class="language-plaintext highlighter-rouge">int32_t</code> for the index,
but it’s easy to substitute, say, <code class="language-plaintext highlighter-rouge">size_t</code>. I try to optimize for the
common case, where a 31-bit index is more than sufficient, and a signed
type since <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf">subscripts should be signed</a>. Internally it uses unsigned
types since overflow is both expected and harmless thanks to the
power-of-two hash table size.</p>

<p>It’s the caller’s responsibility to compute the hash, and the MSI iterator
tells the caller <em>where to look next</em>. For insertion, the caller (maybe)
looks either for an existing entry to override, or an empty slot. For
lookup, the caller looks for a matching entry, giving up as soon as it
find an empty slot. An insertion loop looks like this string intern table:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EXP 15
</span>
<span class="c1">// Initialize all slots to an "empty" value (null)</span>
<span class="cp">#define HT_INIT { {0}, 0 }
</span><span class="k">struct</span> <span class="n">ht</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">ht</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">key</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">EXP</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="c1">// empty, insert here</span>
            <span class="k">if</span> <span class="p">((</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="mi">1</span> <span class="o">==</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
            <span class="p">}</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">key</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">key</span><span class="p">))</span> <span class="p">{</span>
            <span class="c1">// found, return canonical instance</span>
            <span class="k">return</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The caller initializes the iterator to the hash result. This will probably
be out of range, even negative, but that doesn’t matter. The iterator
function will turn it into a valid index before use. This detail is key to
<em>double hashing</em>: The low bits of the hash tell it where to start, and the
high bits tell it how to step. The hash table size is a power of two, and
the step size is forced to an odd number (via <code class="language-plaintext highlighter-rouge">| 1</code>), so it’s guaranteed
to visit each slot in the table exactly once before restarting. It’s
important that the search halts before looping, such as by guaranteeing
the existence of an empty slot (i.e. the “out of memory” check).</p>

<p>Note: The example out of memory check pushes the hash table to the
absolute limit, and in practice you’d want to stop at a smaller load
factor — perhaps even as low as 50% since that’s simple and fast.
Otherwise it degrades into a linear search as the table approaches
capacity.</p>

<p>Even if two keys start or land at the same place, they’ll quickly diverge
due to differing steps. For awhile I used plain linear probing — i.e.
<code class="language-plaintext highlighter-rouge">step=1</code> — but double hashing came out ahead every time I benchmarked,
steering me towards this “MSI” construction. Ideally <code class="language-plaintext highlighter-rouge">ht_lookup</code> would be
placed so that it’s inlined — e.g. in the same translation unit — so that
the mask and step are not actually recomputed each iteration.</p>

<h3 id="deletion">Deletion</h3>

<p>What about deletion? First, consider how infrequently you delete entries
from a hash table. When was the last time you used <code class="language-plaintext highlighter-rouge">del</code> on a dictionary
in Python, or <code class="language-plaintext highlighter-rouge">delete</code> on a <code class="language-plaintext highlighter-rouge">map</code> in Go? This operation is rarely needed.
However, when you <em>do</em> need it, reserve a gravestone value in addition to
the empty value.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">char</span> <span class="n">gravestone</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"(deleted)"</span><span class="p">;</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">dest</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="c1">// ...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="c1">// ...</span>
            <span class="n">dest</span> <span class="o">=</span> <span class="n">dest</span> <span class="o">?</span> <span class="n">dest</span> <span class="o">:</span> <span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="o">*</span><span class="n">dest</span> <span class="o">=</span> <span class="n">key</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">key</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">gravestone</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">dest</span> <span class="o">=</span> <span class="n">dest</span> <span class="o">?</span> <span class="n">dest</span> <span class="o">:</span> <span class="o">&amp;</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(...))</span> <span class="p">{</span>
            <span class="c1">// ...</span>
        <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">unintern</span><span class="p">(</span><span class="k">struct</span> <span class="n">ht</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">gravestone</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// skip over</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strcmp</span><span class="p">(...))</span> <span class="p">{</span>
            <span class="kt">char</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">gravestone</span><span class="p">;</span>
            <span class="k">return</span> <span class="n">old</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When searching, skip over gravestones. Note that gravestones are compared
with <code class="language-plaintext highlighter-rouge">==</code> (identity), so this does not preclude a string <code class="language-plaintext highlighter-rouge">"(deleted)"</code>.
When inserting, use the first gravestone found if no entry was found.</p>

<h3 id="as-a-database-index">As a database index</h3>

<p>Iterating over the example string intern table is simple: Iterate over the
underlying array, skipping empty slots (and maybe gravestones). Entries
will be in a random order rather than, say, insertion order. This is a
useful introductory example, but this isn’t where MSI most shines. As
mentioned, it’s best when treated like a database index.</p>

<p>Let’s take a step back and consider the caller of <code class="language-plaintext highlighter-rouge">intern</code>. How does it
allocate these strings? Perhaps they’re <a href="/blog/2022/05/22/">appended to a buffer</a>, and
<code class="language-plaintext highlighter-rouge">intern</code> indicates whether or not the string is unique so far.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="c1">// lookup table over the buffer</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ht</span><span class="p">;</span>

    <span class="c1">// a collection of strings</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">BUFLEN</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Strings are only appended to the buffer when unique, and the hash table
can make that determination in constant time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">buf_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">len</span> <span class="o">&gt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
    <span class="p">}</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">candidate</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">+</span> <span class="n">buf</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">candidate</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">result</span> <span class="o">==</span> <span class="n">candidate</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// string is unique, keep it</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In my first example, <code class="language-plaintext highlighter-rouge">EXP</code> was fixed. This could be converted into a
dynamic allocation and the hash table resized as needed. Here’s a new
constructor, which I’m including since I think it’s instructive:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">ht</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">exp</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">**</span><span class="n">ht</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">ht</span>
<span class="nf">ht_new</span><span class="p">(</span><span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ht</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">exp</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="n">assert</span><span class="p">(</span><span class="n">exp</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">exp</span> <span class="o">&gt;=</span> <span class="mi">32</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">ht</span><span class="p">;</span>  <span class="c1">// request too large</span>
    <span class="p">}</span>

    <span class="n">ht</span><span class="p">.</span><span class="n">ht</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">((</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">exp</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="k">return</span> <span class="n">ht</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">intern</code> fails, the hash table can be replaced with a new table twice
as large, and since, like a database index, its contents are entirely
redundant, <em>the hash table can be discarded and rebuilt from scratch</em>. The
new and old table don’t need to exist simultaneously. Here’s a routine to
populate an empty hash table from the buffer:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">buf_rehash</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">off</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">off</span> <span class="o">&lt;</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;)</span> <span class="p">{</span>
        <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">+</span> <span class="n">off</span><span class="p">;</span>
        <span class="kt">int32_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">off</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
        <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
                <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how this iterates in insertion order, which may be useful in other
cases, too. On the rehash it doesn’t need to check for existing entries,
as all entries are already known to be unique. Later when <code class="language-plaintext highlighter-rouge">intern</code> hits
its capacity:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">result</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">.</span><span class="n">ht</span><span class="p">);</span>
        <span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span> <span class="o">=</span> <span class="n">ht_new</span><span class="p">(</span><span class="n">ht</span><span class="p">.</span><span class="n">exp</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// out of memory</span>
        <span class="p">}</span>
        <span class="n">buf_rehash</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">,</span> <span class="n">candidate</span><span class="p">);</span>  <span class="c1">// cannot fail</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>I freed and reallocated the table, but it would be trivial to use a
<code class="language-plaintext highlighter-rouge">realloc</code> instead, unlike the case where the old table <em>isn’t</em> redundant.</p>

<h3 id="multimaps">Multimaps</h3>

<p>An MSI hash table is trivially converted into a multimap, a hash table
with multiple values per key. Callers just make one small change: <em>Don’t
stop searching until an empty slot is found</em>. Each match is an additional
multimap value. The “value array” is stored along the hash table itself,
in insertion order, without additional allocations.</p>

<p>For example, imagine the strings in the string buffer have a namespace
prefix, delimited by a colon, like <code class="language-plaintext highlighter-rouge">city:Austin</code> and <code class="language-plaintext highlighter-rouge">state:Texas</code>. We’d
like a fast lookup of all strings under a particular namespace. The
solution is to add another hash table as you would an index to a database
table.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="c1">// ..</span>
    <span class="k">struct</span> <span class="n">ht</span> <span class="n">ns</span><span class="p">;</span>
    <span class="c1">// ..</span>
<span class="p">};</span>
</code></pre></div></div>

<p>When a unique string is appended it’s also registered in the namespace
multimap. It doesn’t check for an existing key, only for an empty slot,
since it’s a multimap:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// Check outside the loop since it always inserts.</span>
    <span class="k">if</span> <span class="p">(</span><span class="cm">/* ... ns multimap lacks capacity ... */</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// ... grow+rehash ns mutilmap ...</span>
    <span class="p">}</span>

    <span class="kt">int32_t</span> <span class="n">nslen</span> <span class="o">=</span> <span class="n">strcspn</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">":"</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">nslen</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">len</span><span class="o">++</span><span class="p">;</span>
            <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">;</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>It includes the <code class="language-plaintext highlighter-rouge">:</code> as a terminator which simplifies lookups. Here’s a
lookup loop to print all strings under a namespace (includes terminal <code class="language-plaintext highlighter-rouge">:</code>
in the key):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">ns</span> <span class="o">=</span> <span class="s">"city:"</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">nslen</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">ns</span><span class="p">);</span>
    <span class="c1">// ...</span>

    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span> <span class="n">nslen</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">h</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">ht_lookup</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">exp</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">ns</span><span class="o">-&gt;</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">ns</span><span class="p">,</span> <span class="n">nslen</span><span class="p">))</span> <span class="p">{</span>
            <span class="n">puts</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">ns</span><span class="p">.</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">nslen</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>An alternative approach to multimaps is to additionally key over a value
subscript. For example, the first city is keyed <code class="language-plaintext highlighter-rouge">{"city", 0}</code>, the next
<code class="language-plaintext highlighter-rouge">{"city", 1}</code>, etc. The value subscript could be mixed into the string
hash with an <a href="/blog/2018/07/31/">integer permutation</a> (more on this below):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash64</span><span class="p">(</span><span class="n">val_idx</span> <span class="o">^</span> <span class="n">hash</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">nslen</span><span class="p">));</span>
</code></pre></div></div>

<p>The lookup loop would compare both the string and the value subscript, and
stop when it finds a match. The underlying hash table is not truly a
multimap, but rather a plain hash table with a larger key. This requires
extra bookkeeping — tracking individual subscripts and the number of
values per key — but provides constant time random access on the multimap
value array.</p>

<h3 id="hash-functions">Hash functions</h3>

<p>The MSI iterator leaves hashing up to the caller, who has better knowledge
about the input and how to hash it, though this takes a bit of knowledge
of how to build a hash function. The good news is that it’s easy, and less
is more. Better to do too little than too much, and a faster, weaker hash
function is worth a few extra collisions.</p>

<p>The first rule is to never lose sight of the goal: The purpose of the hash
function is to uniformly distribute entries over a table. The better you
know and exploit your input, the less you need to do in the hash function.
Sometimes your keys already contain random data, and so your hash function
can be the identity function! For example, if your keys are <a href="https://www.rfc-editor.org/rfc/rfc4122#section-4.4">“version 4”
UUIDs</a>, don’t waste time hashing them, just load a few bytes from the
end as an integer and you’re done.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// "Hash" a v4 UUID</span>
<span class="kt">uint64_t</span> <span class="nf">uuid4_hash</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">uuid</span><span class="p">[</span><span class="mi">16</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">h</span><span class="p">,</span> <span class="n">uuid</span><span class="o">+</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A reasonable start for strings is <a href="https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function">FNV-1a</a>, such as this possible
implementation for my <code class="language-plaintext highlighter-rouge">hash()</code> function above:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">hash</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mh">0x100</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">h</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mi">255</span><span class="p">;</span>
        <span class="n">h</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">h</span> <span class="o">^</span> <span class="n">h</span><span class="o">&gt;&gt;</span><span class="mi">32</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The hash state is initialized to a <em>basis</em>, some arbitrary value. This a
useful place to introduce a seed or hash key. It’s best that at least one
bit above the low mix-in bits is set so that it’s not trivially stuck at
zero. Above, I’ve chosen the most trivial basis with reasonable results,
though often I’ll use the digits of π.</p>

<p>Next XOR some input into the low bits. This could be a byte, a Unicode
code point, etc. More is better, since otherwise you’re stuck doing more
work per unit, the main weakness of FNV-1a. Carefully note the byte mask,
<code class="language-plaintext highlighter-rouge">&amp; 255</code>, which inhibits sign extension. <strong>Do not mix sign-extended inputs
into FNV-1a</strong> — a widespread implementation mistake.</p>

<p>Multiply by a large, odd random-ish integer. A prime is a reasonable
choice, and I usually pick my favorite prime, shown above: 19 ones in base
10.</p>

<p>Finally, my own touch, an xorshift finalizer. The high bits are much
better mixed than the low bits, so this improves the overall quality.
Though if you take time to benchmark, you might find that this finalizer
isn’t necessary. Remember, do <em>just</em> enough work to keep the number of
collisions low — not <em>lowest</em> — and no more.</p>

<p>If your input is made of integers, or is a short, fixed length, use an
<a href="/blog/2018/07/31/">integer permutation</a>, particularly multiply-xorshift. It takes very
little to get a sufficient distribution. Sometimes one multiplication does
the trick. Fixed-sized, integer-permutation hashes tend to be the fastest,
easily beating fancier SIMD-based hashes, <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">including AES-NI</a>. For
example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Hash a timestamp-based, version 1 UUID</span>
<span class="kt">uint64_t</span> <span class="nf">uuid1_hash</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">uuid</span><span class="p">[</span><span class="mi">16</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">uuid</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="mh">0x3243f6a8885a308d</span><span class="p">;</span>  <span class="c1">// digits of pi</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">33</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*=</span> <span class="mi">1111111111111111111</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">33</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I benchmarked this in a real program, I would probably cut it down even
further, deleting hash operations one at a time and measuring the overall
hash table performance. This <code class="language-plaintext highlighter-rouge">memcpy</code> trick works well with floats, too,
especially packing two single precision floats into one 64-bit integer.</p>

<p>If you ever <a href="https://mort.coffee/home/tar/">hesitate to build a hash table</a> when the situation
calls, I hope the MSI technique will make the difference next time. I have
more hash table tricks up my sleeve, but since they’re not specific to MSI
I’ll save them for a future article.</p>

<h3 id="benchmarks">Benchmarks</h3>

<p>There have been objections to my claims about performance, so <a href="https://gist.github.com/skeeto/8e7934318560ac739c126749d428a413">I’ve
assembled some benchmarks</a>. These demonstrate that:</p>

<ul>
  <li>AES-NI slower than an integer permutation, at least for short keys.</li>
  <li>A custom, 10-line MSI hash table is easily an order of magnitude faster
than a typical generic hash table from your language’s standard library.
This isn’t because the standard hash table is inferior, but because <a href="https://vimeo.com/644068002">it
wasn’t written for your specific problem</a>.</li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>My new debugbreak command</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/07/31/"/>
    <id>urn:uuid:c333d1ab-86b5-4389-b2b7-325d0eb90987</id>
    <updated>2022-07-31T12:59:59Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I <a href="/blog/2022/06/26/">previously mentioned</a> the Windows feature where <a href="https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-registerhotkey">pressing
F12</a> in a debuggee window causes it to break in the debugger. It
works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the
hotkey simply raises a breakpoint <a href="https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp">structured exception</a>. It’s been
surprisingly useful, and I’ve wanted it available in more contexts, such
as console programs or even on Linux. The result is a new <a href="https://github.com/skeeto/w64devkit/blob/4282797/src/debugbreak.c"><code class="language-plaintext highlighter-rouge">debugbreak</code>
command</a>, now included in <a href="/blog/2020/05/15/">w64devkit</a>. Though, of course, you
already have <a href="/blog/2020/09/25/">everything you need</a> to build it and try it out right
now. I’ve also worked out a Linux implementation.</p>

<p>It’s named after an <a href="https://docs.microsoft.com/en-us/visualstudio/debugger/debugbreak-and-debugbreak">MSVC intrinsic and Win32 function</a>. It takes no
arguments, and its operation is indiscriminate: It raises a breakpoint
exception in <em>all</em> debuggee processes system-wide. Reckless? Perhaps, but
certainly convenient. You don’t need to tell it which process you want to
pause. It just works, and a good debugging experience is one of ease and
convenience.</p>

<p>The linchpin is <a href="https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-debugbreakprocess">DebugBreakProcess</a>. The command walks the process
list and fires this function at each process. Nothing happens for programs
without a debugger attached, so it doesn’t even bother checking if it’s a
debuggee. It couldn’t be simpler. I’ve used it on everything from Windows
XP to Windows 11, and it’s worked flawlessly.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="n">s</span> <span class="o">=</span> <span class="n">CreateToolhelp32Snapshot</span><span class="p">(</span><span class="n">TH32CS_SNAPPROCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">PROCESSENTRY32W</span> <span class="n">p</span> <span class="o">=</span> <span class="p">{</span><span class="k">sizeof</span><span class="p">(</span><span class="n">p</span><span class="p">)};</span>
<span class="k">for</span> <span class="p">(</span><span class="n">BOOL</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32FirstW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">);</span> <span class="n">r</span><span class="p">;</span> <span class="n">r</span> <span class="o">=</span> <span class="n">Process32NextW</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">))</span> <span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">PROCESS_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">th32ProcessID</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">DebugBreakProcess</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I use it almost exclusively from Vim, where I’ve given it a <a href="https://learnvimscriptthehardway.stevelosh.com/chapters/06.html">leader
mapping</a>. With the editor focused, I can type backslash then
<kbd>d</kbd> to pause the debuggee.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">map</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">d</span> <span class="p">:</span><span class="k">call</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"debugbreak"</span><span class="p">)&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>With the debuggee paused, I’m free to add new breakpoints or watchpoints,
or print the call stack to see what the heck it’s busy doing. The
mechanism behind DebugBreakProcess is to create a new thread in the
target, with that thread raising the breakpoint exception. The debugger
will be stopped in this new thread. In GDB you can use the <code class="language-plaintext highlighter-rouge">thread</code>
command to switch over to the thread that actually matters, usually <code class="language-plaintext highlighter-rouge">thr
1</code>.</p>

<h3 id="debugbreak-on-linux">debugbreak on Linux</h3>

<p>On unix-like systems the equivalent of a breakpoint exception is a
<code class="language-plaintext highlighter-rouge">SIGTRAP</code>. There’s already a standard command for sending signals,
<a href="https://man7.org/linux/man-pages/man1/kill.1.html"><code class="language-plaintext highlighter-rouge">kill</code></a>, so a <code class="language-plaintext highlighter-rouge">debugbreak</code> command can be built using nothing more
than a few lines of shell script. However, unlike DebugBreakProcess,
signaling every process with <code class="language-plaintext highlighter-rouge">SIGTRAP</code> will only end in tears. The script
will need a way to determine which processes are debuggees.</p>

<p>Linux exposes processes in the file system as virtual files under <code class="language-plaintext highlighter-rouge">/proc</code>,
where each process appears as a directory. Its <code class="language-plaintext highlighter-rouge">status</code> file includes a
<code class="language-plaintext highlighter-rouge">TracerPid</code> field, which will be non-zero for debuggees. The script
inspects this field, and if non-zero sends a <code class="language-plaintext highlighter-rouge">SIGTRAP</code>.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="k">for </span>pid <span class="k">in</span> <span class="si">$(</span>find /proc <span class="nt">-maxdepth</span> 1 <span class="nt">-printf</span> <span class="s1">'%f\n'</span> | <span class="nb">grep</span> <span class="s1">'^[0-9]\+$'</span><span class="si">)</span><span class="p">;</span> <span class="k">do
    </span><span class="nb">grep</span> <span class="nt">-q</span> <span class="s1">'^TracerPid:\s[^0]'</span> /proc/<span class="nv">$pid</span>/status 2&gt;/dev/null <span class="o">&amp;&amp;</span>
        <span class="nb">kill</span> <span class="nt">-TRAP</span> <span class="nv">$pid</span>
<span class="k">done</span>
</code></pre></div></div>

<p>This script, now part of <a href="/blog/2012/06/23/">my dotfiles</a>, has worked very well so
far, and effectively smoothes over some debugging differences between
Windows and Linux, reducing my context switching mental load. There’s
probably a better way to express this script, but that’s the best I could
do so far. On the BSDs you’d need to parse the output of <code class="language-plaintext highlighter-rouge">ps</code>, though each
system seems to do its own thing for distinguishing debuggees.</p>

<h3 id="a-missing-feature">A missing feature</h3>

<p>I had originally planned for one flag, <code class="language-plaintext highlighter-rouge">-k</code>. Rather than breakpoint
debugees, it would terminate all debuggee processes. This is especially
important on Windows where debuggee processes block builds due to file
locking shenanigans. I’d just run <code class="language-plaintext highlighter-rouge">debugbreak -k</code> as part of the build.
However, it’s not possible to terminate debuggees paused in the debugger —
the common situation. I’ve given up on this for now.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Assertions should be more debugger-oriented</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/06/26/"/>
    <id>urn:uuid:22ae914c-971b-4cee-ba48-a189db1b6df6</id>
    <updated>2022-06-26T18:51:04Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="python"/><category term="java"/>
    <content type="html">
      <![CDATA[<p>Prompted by <a href="https://www.youtube.com/watch?v=r9eQth4Q5jg">a 20 minute video</a>, over the past month I’ve improved my
debugger skills. I’d shamefully acquired a bad habit: avoiding a debugger
until exhausting dumber, insufficient methods. My <em>first</em> choice should be
a debugger, but I had allowed a bit of friction to dissuade me. With some
thoughtful practice and deliberate effort clearing the path, my bad habit
is finally broken — at least when a good debugger is available. It feels
like I’ve leveled up and, <a href="/blog/2017/04/01/">like touch typing</a>, this was a skill I’d
neglected far too long. One friction point was the less-than-optimal
<code class="language-plaintext highlighter-rouge">assert</code> feature in basically every programming language implementation.
It ought to work better with debuggers.</p>

<p>An assertion verifies a program invariant, and so if one fails then
there’s undoubtedly a defect in the program. In other words, assertions
make programs more sensitive to defects, allowing problems to be caught
more quickly and accurately. Counter-intuitively, crashing early and often
makes for more robust and reliable software in the long run. For exactly
this reason, assertions go especially well with <a href="/blog/2019/01/25/">fuzzing</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">assert</span><span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">);</span>   <span class="c1">// bounds check</span>
<span class="n">assert</span><span class="p">((</span><span class="kt">ssize_t</span><span class="p">)</span><span class="n">size</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// suspicious size_t</span>
<span class="n">assert</span><span class="p">(</span><span class="n">cur</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">!=</span> <span class="n">cur</span><span class="p">);</span>    <span class="c1">// circular reference?</span>
</code></pre></div></div>

<p>They’re sometimes abused for error handling, which is a reason they’ve
also been (wrongfully) discouraged at times. For example, failing to open
a file is an error, not a defect, so an assertion is inappropriate.</p>

<p>Normal programs have implicit assertions all over, even if we don’t
usually think of them as assertions. In some cases they’re checked by the
hardware. Examples of implicit assertion failures:</p>

<ul>
  <li>Out-of-bounds indexing</li>
  <li>Dereferencing null/nil/None</li>
  <li>Dividing by zero</li>
  <li>Certain kinds of integer overflow (e.g. <code class="language-plaintext highlighter-rouge">-ftrapv</code>)</li>
</ul>

<p>Programs are generally not intended to recover from these situations
because, had they been anticipated, the invalid operation wouldn’t have
been attempted in the first place. The program simply crashes because
there’s no better alternative. Sanitizers, including Address Sanitizer
(ASan) and Undefined Behavior Sanitizer (UBSan), are in essence
additional, implicit assertions, checking invariants that aren’t normally
checked.</p>

<p>Ideally a failing assertion should have these two effects:</p>

<ul>
  <li>
    <p>Execution should <em>immediately</em> stop. The program is in an unknown state,
so it’s neither safe to “clean up” nor attempt to recover. Additional
execution will only make debugging more difficult, and may obscure the
defect.</p>
  </li>
  <li>
    <p>When run under a debugger — or visited as a core dump — it should break
exactly at the failed assertion, ready for inspection. I should not need
to dig around the call stack to figure out where the failure occurred. I
certainly shouldn’t need to manually set a breakpoint and restart the
program hoping to fail the assertion a second time. The whole reason for
using a debugger is to save time, so if it’s wasting my time then it’s
failing at its primary job.</p>
  </li>
</ul>

<p>I examined standard <code class="language-plaintext highlighter-rouge">assert</code> features across various language
implementations, and none strictly meet the criteria. Fortunately, in some
cases, it’s trivial to build a better assertion, and you can substitute
your own definition. First, let’s discuss the way assertions disappoint.</p>

<h3 id="a-test-assertion">A test assertion</h3>

<p>My test for C and C++ is minimal but establishes some state and gives me a
variable to inspect:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;assert.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">5</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then I compile and debug in the most straightforward way:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">r</code> in GDB stands for <code class="language-plaintext highlighter-rouge">run</code>, which immediately breaks because of the
<code class="language-plaintext highlighter-rouge">assert</code>. The <code class="language-plaintext highlighter-rouge">bt</code> prints a backtrace. On a typical Linux distribution
that shows this backtrace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  __GI_raise
#1  __GI_abort
#2  __assert_fail_base
#3  __GI___assert_fail
#4  main
</code></pre></div></div>

<p>Well, actually, it’s much messier than this, but I manually cleaned it up:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1  0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2  0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i &lt; 5", file=0x555555556004 "test.c", line=6, f
unction=&lt;optimized out&gt;) at assert.c:92
#3  0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i &lt; 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 &lt;__PRETTY_FUNCTION__.0&gt; "main") at assert
.c:101
#4  0x0000555555555178 in main () at test.c:6
</code></pre></div></div>

<p>That’s a lot to take in at a glance, and about 95% of it is noise that
will never contain useful information. Most notably, GDB didn’t stop at
the failing assertion. Instead there’s <em>four stack frames</em> of libc junk I
have to navigate before I can even begin debugging.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) up
(gdb) up
(gdb) up
(gdb) up
</code></pre></div></div>

<p>I must wade through this for every assertion failure. This is some of the
friction that made me avoid the debugger in the first place. glibc loves
indirection, so maybe the other libc implementations do better? How about
musl?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  setjmp
#1  raise
#2  ??
#3  ??
#4  ??
#5  ??
#6  ??
#7  ??
#8  ??
#9  ??
#10 ??
#11 ??
</code></pre></div></div>

<p>Oops, without musl debugging symbols I can’t debug assertions at all
because GDB can’t read the stack, so it’s lost. If you’re on Alpine you
can install <code class="language-plaintext highlighter-rouge">musl-dbg</code>, but otherwise you’ll probably need to build your
own from source. With debugging symbols, musl is no better than glibc:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  __restore_sigs
#1  raise
#2  abort
#3  __assert_fail
#4  main
</code></pre></div></div>

<p>Same with FreeBSD:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  thr_kill
#1  in raise
#2  in abort
#3  __assert
#4  main
</code></pre></div></div>

<p>OpenBSD has one fewer frame:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  thrkill
#1  _libc_abort
#2  _libc___assert2
#3  main
</code></pre></div></div>

<p>How about on Windows with Mingw-w64?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Inferior 1 (process 7864) exited with code 03]
</code></pre></div></div>

<p>Oops, on Windows GDB doesn’t break at all on <code class="language-plaintext highlighter-rouge">assert</code>. You must first set
a breakpoint on <code class="language-plaintext highlighter-rouge">abort</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) b abort
</code></pre></div></div>

<p>Besides that, it’s the most straightforward so far:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0 msvcrt!abort
#1 msvcrt!_assert
#2 main
</code></pre></div></div>

<p>With MSVC (default CRT) I get something slightly different:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh
</code></pre></div></div>

<p>RemedyBG leaves me at the <code class="language-plaintext highlighter-rouge">abort</code> like GDB does elsewhere. Visual Studio
recognizes that I don’t care about its stack frames and instead puts the
focus on the assertion, ready for debugging. The other stack frames are
there, but basically invisible. It’s the only case that practically meets
all my criteria!</p>

<p>I can’t entirely blame these implementations. The C standard requires that
<code class="language-plaintext highlighter-rouge">assert</code> print a diagnostic and call <code class="language-plaintext highlighter-rouge">abort</code>, and that <code class="language-plaintext highlighter-rouge">abort</code> raises
<code class="language-plaintext highlighter-rouge">SIGABRT</code>. There’s not much implementations can do, and it’s up to the
debugger to be smarter about it.</p>

<h3 id="sanitizers">Sanitizers</h3>

<p>ASan doesn’t break GDB on assertion failures, which is yet another source
of friction. You can work around this with an environment variable:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export ASAN_OPTIONS=abort_on_error=1:print_legend=0
</code></pre></div></div>

<p>This works, but it’s the worst case of all: I get 7 junk stack frames on
top of the failed assertion. It’s also very noisy when it traps, so the
<code class="language-plaintext highlighter-rouge">print_legend=0</code> helps to cut it down a bit. I want this variable so often
that I set it in my shell’s <code class="language-plaintext highlighter-rouge">.profile</code> so that it’s always set.</p>

<p>With UBSan you can use <code class="language-plaintext highlighter-rouge">-fsanitize-undefined-trap-on-error</code>, which behaves
like the improved assertion. It traps directly on the defect with no junk
frames, though it prints no diagnostic. As a bonus, it also means you
don’t need to link <code class="language-plaintext highlighter-rouge">libubsan</code>. Thanks to the bonus, it fully supplants
<code class="language-plaintext highlighter-rouge">-ftrapv</code> for me on all platforms.</p>

<p><strong>Update November 2022</strong>: This “stop” hook eliminates ASan friction by
popping runtime frames — functions with the reserved <code class="language-plaintext highlighter-rouge">__</code> prefix — from
the call stack so that they’re not in the way when GDB takes control. It
requires Python support, which is the purpose of the feature-sniff outer
condition.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if !$_isvoid($_any_caller_matches)
    define hook-stop
        while $_thread &amp;&amp; $_any_caller_matches("^__")
            up-silently
        end
    end
end
</code></pre></div></div>

<p>This is now part of my <code class="language-plaintext highlighter-rouge">.gdbinit</code>.</p>

<h3 id="a-better-assertion">A better assertion</h3>

<p>At least when under a debugger, here’s a much better assertion macro for
GCC and Clang:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define assert(c) if (!(c)) __builtin_trap()
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">__builtin_trap</code> inserts a trap instruction — a built-in breakpoint. By
not calling a function to raise a signal, there are no junk stack frames
and no need to breakpoint on <code class="language-plaintext highlighter-rouge">abort</code>. It stops exactly where it should as
quickly as possible. This definition works reliably with GCC across all
platforms, too. On MSVC the equivalent is <code class="language-plaintext highlighter-rouge">__debugbreak</code>. If you’re really
in a pinch then do whatever it takes to trigger a fault, like
dereferencing a null pointer. A more complete definition might be:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef DEBUG
#  if __GNUC__
#    define assert(c) if (!(c)) __builtin_trap()
#  elif _MSC_VER
#    define assert(c) if (!(c)) __debugbreak()
#  else
#    define assert(c) if (!(c)) *(volatile int *)0 = 0
#  endif
#else
#  define assert(c)
#endif
</span></code></pre></div></div>

<p>None of these print a diagnostic, but that’s unnecessary when a debugger
is involved.</p>

<h3 id="other-languages">Other languages</h3>

<p>Unfortunately the situation <a href="https://github.com/rust-lang/rust/issues/21102">mostly gets worse</a> with other language
implementations, and it’s generally not possible to build a better
assertion. Assertions typically have exception-like semantics, if not
literally just another exception, and so they are far less reliable. If a
failed assertion raises an exception, then the program won’t stop until
it’s unwound the stack — running destructors and such along the way — all
the way to the top level looking for a handler. It only knows there’s a
problem when nobody was there to catch it.</p>

<p><a href="https://go.dev/doc/faq#assertions">Go officially doesn’t have assertions</a>, though panics are a kind of
assertion. However, panics have exception-like semantics, and so suffer
the problems of exceptions. A Go version of my test:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">defer</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"DEFER"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="m">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="m">5</span> <span class="p">{</span>
            <span class="nb">panic</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I run this under Go’s premier debugger, <a href="https://github.com/go-delve/delve">Delve</a>, the unrecovered
panic causes it to break. So far so good. However, I get two junk frames:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit
</code></pre></div></div>

<p>It only knows to stop because the Go runtime called <code class="language-plaintext highlighter-rouge">fatalpanic</code>, but the
backtrace is a fiction: The program continued to run after the panic,
enough to run all the registered defers (including printing “DEFER”),
unwinding the stack to the top level, and only then did it <code class="language-plaintext highlighter-rouge">fatalpanic</code>.
Fortunately it’s still possible to inspect all those stack frames even if
some variables may have changed while unwinding, but it’s more like
inspecting a core dump than a paused process.</p>

<p>The situation in Python is similar: <code class="language-plaintext highlighter-rouge">assert</code> raises AssertionError — a
plain old exception — and <code class="language-plaintext highlighter-rouge">pdb</code> won’t break until the stack has unwound,
exiting context managers and such. Only once the exception reaches the top
level does it enter “post mortem debugging,” like a core dump. At least
there are no junk stack frames on top. If you’re using asyncio then your
program may continue running for quite awhile before the right tasks are
scheduled and the exception finally propagates to the top level, if ever.</p>

<p>The worst offender of all is Java. First <code class="language-plaintext highlighter-rouge">jdb</code> never breaks for unhandled
exceptions. It’s up to you to set a breakpoint before the exception is
thrown. But it gets worse: assertions are disabled under <code class="language-plaintext highlighter-rouge">jdb</code>. The Java
<code class="language-plaintext highlighter-rouge">assert</code> statement is worse than useless.</p>

<h3 id="addendum-dont-exit-the-debugger">Addendum: Don’t exit the debugger</h3>

<p>The largest friction-reducing change I made is never exiting the debugger.
Previously I would enter GDB, run my program, exit, edit/rebuild, repeat.
However, there’s no reason to exit GDB! It automatically and reliably
reloads symbols and updates breakpoints on symbols. It remembers your run
configuration, so re-running is just <code class="language-plaintext highlighter-rouge">r</code> rather than interacting with
shell history.</p>

<p>My workflow on all platforms (<a href="/blog/2020/05/15/">including Windows</a>) is a vertically
maximized Vim window and a vertically maximized terminal window. The new
part for me: The terminal runs a long-term GDB session exclusively, with
<code class="language-plaintext highlighter-rouge">file</code> set to the program I’m writing, usually set by initial the command
line.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gdb myprogram
gdb&gt;
</code></pre></div></div>

<p>Alternatively use <code class="language-plaintext highlighter-rouge">file</code> after starting GDB. Occasionally useful if my
project has multiple binaries, and I want to examine a different program.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; file myprogram
</code></pre></div></div>

<p>I use <code class="language-plaintext highlighter-rouge">make</code> and Vim’s <code class="language-plaintext highlighter-rouge">:mak</code> command for building from within the editor,
so I don’t need to change context to build. The quickfix list takes me
straight to warnings/errors. Often I’m writing something that takes input
from standard input. So I use the <code class="language-plaintext highlighter-rouge">run</code> (<code class="language-plaintext highlighter-rouge">r</code>) command to set this up
(along with any command line arguments).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; r &lt;test.txt
</code></pre></div></div>

<p>You can redirect standard output as well. It remembers these settings for
plain <code class="language-plaintext highlighter-rouge">run</code> later, so I can test my program by entering <code class="language-plaintext highlighter-rouge">r</code> and nothing
else.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; r
</code></pre></div></div>

<p>My usual workflow is edit, <code class="language-plaintext highlighter-rouge">:mak</code>, <code class="language-plaintext highlighter-rouge">r</code>, repeat. If I want to test a
different input or use different options, change the run configuration
using <code class="language-plaintext highlighter-rouge">run</code> again:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; r -a -b -c &lt;test2.txt
</code></pre></div></div>

<p>On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use <code class="language-plaintext highlighter-rouge">kill</code> (<code class="language-plaintext highlighter-rouge">k</code>) to stop it
without exiting GDB.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; k
</code></pre></div></div>

<p>GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
<code class="language-plaintext highlighter-rouge">set confirm no</code> in your <code class="language-plaintext highlighter-rouge">.gdbinit</code> to disable it.</p>

<p>Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with <code class="language-plaintext highlighter-rouge">pkill</code>, which
GDB will catch (except on Windows):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:!pkill myprogram
</code></pre></div></div>

<p>I suspect many people don’t know this, but if you’re on Windows and
<a href="/blog/2021/03/11/">developing a graphical application</a>, you can <a href="https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-registerhotkey">press F12</a> in the
debuggee’s window to immediately break the program in the attached
debugger. This is a general platform feature and works with any native
debugger. I’ve been using it quite a lot.</p>

<p>On that note, you can run commands from GDB with <code class="language-plaintext highlighter-rouge">!</code>, which is another way
to avoid having an extra terminal window around:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; !git diff
</code></pre></div></div>

<p>In any case, GDB will re-read the binary on the next <code class="language-plaintext highlighter-rouge">run</code> and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; b somefunc
gdb&gt; r
</code></pre></div></div>

<p>Alternatively I’ll use a line number, which I read from Vim. Though GDB,
not being involved in the editing process, cannot track how that line
moves between builds.</p>

<p>An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type <code class="language-plaintext highlighter-rouge">next</code> (<code class="language-plaintext highlighter-rouge">n</code>) — or <code class="language-plaintext highlighter-rouge">step</code> (<code class="language-plaintext highlighter-rouge">s</code>) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; n
gdb&gt;
gdb&gt;
</code></pre></div></div>

<p>(<del>I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful <code class="language-plaintext highlighter-rouge">list</code> command
is inadequate.</del> <strong>Update</strong>: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either <code class="language-plaintext highlighter-rouge">tty</code> or <code class="language-plaintext highlighter-rouge">set new-console</code>. I can access it everywhere since
w64devkit now supports GDB TUI.)</p>

<p>If I want to advance to the next breakpoint, I use <code class="language-plaintext highlighter-rouge">continue</code> (<code class="language-plaintext highlighter-rouge">c</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; c
</code></pre></div></div>

<p>If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep <code class="language-plaintext highlighter-rouge">print</code>ing (<code class="language-plaintext highlighter-rouge">p</code>) the same variables again and again.
So I use <code class="language-plaintext highlighter-rouge">display</code> (<code class="language-plaintext highlighter-rouge">disp</code>) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is <code class="language-plaintext highlighter-rouge">i</code> over some string <code class="language-plaintext highlighter-rouge">str</code>, this will show me the current
character in character format (<code class="language-plaintext highlighter-rouge">/c</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; disp/c str[i]
</code></pre></div></div>

<p>You can accumulate multiple expressions. Use <code class="language-plaintext highlighter-rouge">undisplay</code> to remove them.</p>

<p>Too many breakpoints? Use <code class="language-plaintext highlighter-rouge">info breakpoints</code> (<code class="language-plaintext highlighter-rouge">i b</code>) to list them, then
<code class="language-plaintext highlighter-rouge">delete</code> (<code class="language-plaintext highlighter-rouge">d</code>) the unwanted ones by ID.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; i b
gdb&gt; d 3 5 8
</code></pre></div></div>

<p>GDB has many more feature than this, but 10 commands cover 99% of use
cases: <code class="language-plaintext highlighter-rouge">r</code>, <code class="language-plaintext highlighter-rouge">c</code>, <code class="language-plaintext highlighter-rouge">n</code>, <code class="language-plaintext highlighter-rouge">s</code>, <code class="language-plaintext highlighter-rouge">disp</code>, <code class="language-plaintext highlighter-rouge">k</code>, <code class="language-plaintext highlighter-rouge">b</code>, <code class="language-plaintext highlighter-rouge">i</code>, <code class="language-plaintext highlighter-rouge">d</code>, <code class="language-plaintext highlighter-rouge">p</code>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>My take on "where's all the code"</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/05/22/"/>
    <id>urn:uuid:2eb07dcf-0d4c-44e7-9133-fd9cf8e83227</id>
    <updated>2022-05-22T23:59:59Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://lobste.rs/s/ny4ymx">on Lobsters</a>.</em></p>

<p>Earlier this month Ted Unangst researched <a href="https://flak.tedunangst.com/post/compiling-an-openbsd-kernel-50-faster">compiling the OpenBSD kernel
50% faster</a>, which involved stubbing out the largest, extraneous
branches of the source tree. To find the lowest-hanging fruit, he <a href="https://flak.tedunangst.com/post/watc">wrote a
tool</a> called <a href="https://humungus.tedunangst.com/r/watc">watc</a> — <em>where’s all the code</em> — that displays an
interactive “usage” summary of a source tree oriented around line count. A
followup post <a href="https://flak.tedunangst.com/post/parallel-tree-running">about exploring the tree in parallel</a> got me thinking
about the problem, especially since <a href="/blog/2022/05/14/">I had just written about a concurrent
queue</a>. Turning it over in my mind, I saw opportunities for interesting
data structures and memory management, and so I wanted to write my own
version of the tool, <a href="https://github.com/skeeto/scratch/blob/master/windows/watc.c"><strong><code class="language-plaintext highlighter-rouge">watc.c</code></strong></a>, which is the subject of this
article.</p>

<!--more-->

<p>The original <code class="language-plaintext highlighter-rouge">watc</code> is interactive and written in idiomatic Go. My version
is non-interactive, written in C, and currently only supports Windows. Not
only do I prefer batch programs generally, building an interactive user
interface would be complicated and distract from the actual problem I
wanted to tackle. As for the platform restriction, it has some convenient
constraints (for implementers), and my projects are often about shooting
multiple birds with one stone:</p>

<ul>
  <li>
    <p>The longest path is <code class="language-plaintext highlighter-rouge">MAX_PATH</code>, a meager 260 pseudo-UTF-16 code points,
is nice and short. Technically users can now opt-in to a <a href="https://docs.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation">maximum path
length of 32,767</a>, but so little software supports it, including
much of Windows itself, that it’s not worth considering. Even with the
upper limit, each path component is still restricted by <code class="language-plaintext highlighter-rouge">MAX_PATH</code>. I
can rely on this platform restriction in my design.</p>
  </li>
  <li>
    <p>Symbolic links, an annoying edge case, are outside of consideration.
Technically Windows has them, but they’re sufficiently locked away that
they don’t come up in practice.</p>
  </li>
  <li>
    <p>After years of deliberating, I <a href="https://www.youtube.com/watch?v=r9eQth4Q5jg">was finally convinced</a> to buy and
try <a href="https://remedybg.handmade.network/">RememdyBG</a>, a super slick Windows debugger. I especially wanted
to try out its multi-threading support, and I knew I’d be using multiple
threads in this project. Since it’s incompatible with <a href="/blog/2020/05/15/">my development
kit</a>, my program also supports the MSVC compiler.</p>
  </li>
  <li>
    <p>The very same day I <a href="https://github.com/skeeto/w64devkit/commit/1513aa7">improved GDB support</a> in my development kit,
and this was a great opportunity to dogfood the changes. I’ve used my
kit <em>so much</em> these past two years, especially since both it and I have
matured enough that I’m nearly as productive in it as I am on Linux.</p>
  </li>
  <li>
    <p>It’s practice and experience with <a href="/blog/2021/12/30/">the wide API</a>, and the tool
fully supports Unicode paths. Perhaps a bit unnecessary considering how
few source trees stray beyond ASCII, even just in source text — just too
many ways things go wrong otherwise.</p>
  </li>
</ul>

<p>Running my tool on nearly the same source tree as the original example
yields:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\openbsd&gt;watc sys
. 6.89MLOC 364.58MiB
├─dev 5.69MLOC 332.75MiB
│ ├─pci 4.46MLOC 293.80MiB
│ │ ├─drm 3.99MLOC 280.25MiB
│ │ │ ├─amd 3.33MLOC 261.24MiB
│ │ │ │ ├─include 2.61MLOC 238.48MiB
│ │ │ │ │ ├─asic_reg 2.53MLOC 235.07MiB
│ │ │ │ │ │ ├─nbio 689.56kLOC 69.33MiB
│ │ │ │ │ │ ├─dcn 583.67kLOC 58.60MiB
│ │ │ │ │ │ ├─gc 290.26kLOC 28.90MiB
│ │ │ │ │ │ ├─dce 210.16kLOC 16.81MiB
│ │ │ │ │ │ ├─mmhub 155.60kLOC 16.03MiB
│ │ │ │ │ │ ├─dpcs 123.90kLOC 12.97MiB
│ │ │ │ │ │ ├─gca 105.91kLOC 5.87MiB
│ │ │ │ │ │ ├─bif 71.45kLOC 4.41MiB
│ │ │ │ │ │ ├─gmc 64.24kLOC 3.41MiB
│ │ │ │ │ │ └─(other) 230.99kLOC 18.73MiB
│ │ │ │ │ └─(other) 2.10kLOC 139.29kiB
│ │ │ │ └─(other) 718.93kLOC 22.76MiB
│ │ │ └─(other) 583.63kLOC 16.86MiB
│ │ └─(other) 8.53kLOC 259.07kiB
│ └─(other) 1.20MLOC 38.34MiB
└─(other) 1.20MLOC 31.83MiB
</code></pre></div></div>

<p>In place of interactivity it has <code class="language-plaintext highlighter-rouge">-n</code> (lines) and <code class="language-plaintext highlighter-rouge">-d</code> (depth) switches to
control tree pruning, where branches are summarized as <code class="language-plaintext highlighter-rouge">(other)</code> entries.
My idea is for users to run the tool repeatedly with different cutoffs and
filters to get a feel for <em>where’s all the code</em>. (It could really use
more such knobs.) Repeated counting makes performance all the more
important. On my machine, and a hot cache, the above takes ~180ms to count
those 6.89 million lines of code across 8,607 source files.</p>

<p>Each directory is treated like one big source file of its recursively
concatenated contents, so the tool only needs to track directories. Each
directory entry comprises a variable-length string name, line and byte
totals, and tree linkage such that it can be later navigated for sorting
and printing. That linkage has a clever solution, which I’ll get to later.
First, lets deal with strings.</p>

<h3 id="string-management">String management</h3>

<p>It’s important to get out of the null-terminated string business early,
only reverting to their use at system boundaries, such as constructing
paths for the operating system. Better to handle strings as offset/length
pairs into a buffer. Definitely avoid silly things like <a href="https://www.youtube.com/watch?v=f4ioc8-lDc0&amp;t=4407s">allocating many
individual strings</a>, as encouraged by <code class="language-plaintext highlighter-rouge">strdup</code> — and most other
programming language idioms — and certainly avoid <a href="/blog/2021/07/30/">useless functions like
<code class="language-plaintext highlighter-rouge">strcpy</code></a>.</p>

<p>When the operating system provides a path component that I need to track
for later, I intern it into a single, large buffer. That buffer looks like
so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define BUF_MAX  (1 &lt;&lt; 22)
</span><span class="k">struct</span> <span class="n">buf</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">wchar_t</span> <span class="n">buf</span><span class="p">[</span><span class="n">BUF_MAX</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Empirically I determined that even large source trees cumulatively total
on the order of 10,000 characters of directory names. The OpenBSD kernel
source tree is only 2,992 characters of names.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find sys -type d -printf %f | wc -c
2992
</code></pre></div></div>

<p>The biggest I found was the LLVM source tree at 121,720 characters, not
only because of its sheer volume but also because it has generally has
relatively long names. So for my maximum buffer size I just maxed it out
(explained in a moment) and called it good. Even with UTF-16, that’s only
8MiB which is perfectly reasonable to allocate all at once up front. Since
my <a href="https://floooh.github.io/2018/06/17/handles-vs-pointers.html">string handles</a> don’t contain pointers, this buffer could be freely
relocated in the case of <code class="language-plaintext highlighter-rouge">realloc</code>.</p>

<p>The operating system provides a null-terminated string. The buffer makes a
copy and returns a handle. A handle is a 32-bit integer encoding offset
and length.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">buf_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">off</span> <span class="o">=</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">wcslen</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="o">+</span><span class="n">len</span> <span class="o">&gt;</span> <span class="n">BUF_MAX</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>  <span class="c1">// out of memory</span>
    <span class="p">}</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">buf</span><span class="o">+</span><span class="n">off</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">s</span><span class="p">));</span>
    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">len</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">len</span><span class="o">&lt;&lt;</span><span class="mi">22</span> <span class="o">|</span> <span class="n">off</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The negative range is reserved for errors, leaving 31 bits. I allocate 9
to the length — enough for <code class="language-plaintext highlighter-rouge">MAX_PATH</code> of 260 — and the remaining 22 bits
for the buffer offset, exactly matching the range of my <code class="language-plaintext highlighter-rouge">BUF_MAX</code>.
Splitting on a nibble boundary would have displayed more nicely in
hexadecimal during debugging, but oh well.</p>

<p>A couple of helper functions are in order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>     <span class="nf">str_len</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">22</span><span class="p">;</span>      <span class="p">}</span>
<span class="kt">int32_t</span> <span class="nf">str_off</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0x3fffff</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>

<p>Rather than allocate the string buffer on the heap, it’s a <code class="language-plaintext highlighter-rouge">static</code> (read:
too big for the stack) scoped to <code class="language-plaintext highlighter-rouge">main</code>. I consistently call it <code class="language-plaintext highlighter-rouge">b</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">buf</span> <span class="n">b</span><span class="p">;</span>
</code></pre></div></div>

<p>That’s string management solved efficiently in a dozen lines of code. I
briefly considered a hash table to de-duplicate strings in the buffer, but
real source trees aren’t redundant enough to make up for the hash table
itself, plus there’s no reason here to make that sort of time/memory
trade-off.</p>

<h3 id="directory-entries">Directory entries</h3>

<p>I settled on 24-byte directory entries:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">dir</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">nbytes</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">nlines</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">name</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">link</span><span class="p">;</span>
    <span class="kt">int32_t</span>  <span class="n">nsubdirs</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">nbytes</code> I teetered between 32 bits and 64 bits for the byte count. No
source tree I found overflows an unsigned 32-bit integer, but LLVM comes
close, just barely overflowing a signed 31-bit integer as of this year.
Since I wanted 10x over the worst case I could find, that left me with a
64-bit integer for bytes.</p>

<p>For <code class="language-plaintext highlighter-rouge">nlines</code>, 32 bits has plenty of overhead. More importantly, this field
is updated concurrently and atomically by multiple threads — line counting
is parallelized — and I want this program to work on 32-bit hosts limited
to 32-bit atomics.</p>

<p>The <code class="language-plaintext highlighter-rouge">name</code> is the string handle for that directory’s name.</p>

<p>The <code class="language-plaintext highlighter-rouge">link</code> and <code class="language-plaintext highlighter-rouge">nsubdirs</code> is the tree linkage. The <code class="language-plaintext highlighter-rouge">link</code> field is an
index, and serves two different purposes at different times. Initially it
will identify the directory’s parent directory, and I had originally named
it <code class="language-plaintext highlighter-rouge">parent</code>. <code class="language-plaintext highlighter-rouge">nsubdirs</code> is the number of subdirectories, but there is
initially no link to a directory’s children.</p>

<p>Like with the buffer, I pre-allocate all the directory entries I’ll need:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define DIRS_MAX  (1 &lt;&lt; 17)
</span><span class="kt">int32_t</span> <span class="n">ndirs</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">dir</span> <span class="n">dirs</span><span class="p">[</span><span class="n">DIRS_MAX</span><span class="p">];</span>
</code></pre></div></div>

<p>A directory handle is just an index into <code class="language-plaintext highlighter-rouge">dirs</code>. The <code class="language-plaintext highlighter-rouge">link</code> field is one
such handle. Like string handles, directory entries contain no pointers,
and so this <code class="language-plaintext highlighter-rouge">dirs</code> buffer could be freely relocated, <em>a la</em> <code class="language-plaintext highlighter-rouge">realloc</code>, if
the context called for such flexibility. In my program, rather than
allocate this on the heap, it’s just a <code class="language-plaintext highlighter-rouge">static</code> (read: too big for the
stack) scoped to <code class="language-plaintext highlighter-rouge">main</code>.</p>

<p>For <code class="language-plaintext highlighter-rouge">DIRS_MAX</code>, I again looked at the worst case I could find, LLVM, which
requires 12,163 entries. I had hoped for 16-bit directory handles, but
that would limit source trees to 32,768 directories — not quite 10x over
the worst case. I settled on 131,072 entries: 3MiB. At only 11MiB total so
far, in the very worst case, it hardly matters that I couldn’t shave off
these extra few bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ find llvm-project -type d | wc -l
12163
</code></pre></div></div>

<p>Allocating a directory entry is just a matter of bumping the <code class="language-plaintext highlighter-rouge">ndirs</code>
counter. Reading a directory into <code class="language-plaintext highlighter-rouge">dirs</code> looks roughly like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="n">glob</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">L"*"</span><span class="p">);</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">dir</span> <span class="n">dirs</span><span class="p">[</span><span class="n">DIRS_MAX</span><span class="p">];</span>

<span class="kt">int32_t</span> <span class="n">parent</span> <span class="o">=</span> <span class="p">...;</span>  <span class="c1">// an existing directory handle</span>
<span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
<span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">glob</span><span class="p">);</span>

<span class="n">WIN32_FIND_DATAW</span> <span class="n">fd</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">FindFirstFileW</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fd</span><span class="p">);</span>

<span class="k">do</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">FILE_ATTRIBUTE_DIRECTORY</span> <span class="o">&amp;</span> <span class="n">fd</span><span class="p">.</span><span class="n">dwFileAttributes</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">fd</span><span class="p">.</span><span class="n">cFileName</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">name</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">ndirs</span> <span class="o">==</span> <span class="n">DIRS_MAX</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// out of memory</span>
        <span class="p">}</span>
        <span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ndirs</span><span class="o">++</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="n">parent</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nsubdirs</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="c1">// ... process file ...</span>
    <span class="p">}</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">FindNextFileW</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fd</span><span class="p">));</span>

<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>Mentally bookmark that “process file” part. It will be addressed later.</p>

<p>The <code class="language-plaintext highlighter-rouge">buildpath</code> function walks the <code class="language-plaintext highlighter-rouge">link</code> fields, copying (<code class="language-plaintext highlighter-rouge">memcpy</code>) path
components from the string buffer into the <code class="language-plaintext highlighter-rouge">path</code>, separated by
backslashes.</p>

<h3 id="breadth-first-tree-traversal">Breadth-first tree traversal</h3>

<p>At the top-level the program must first traverse a tree. There are two
strategies for traversing a tree (or any graph):</p>

<ul>
  <li>Depth-first: stack-oriented (lends to recursion)</li>
  <li>Breadth-first: queue-oriented</li>
</ul>

<p>Recursion makes me nervous, but besides this, a queue is already a natural
fit for this problem. The tree I build in <code class="language-plaintext highlighter-rouge">dirs</code> is also the breadth-first
processing queue. (Note: This is entirely distinct from the <em>message</em>
queue that I’ll introduce later, and is not a concurrent queue.) Further,
building the tree in <code class="language-plaintext highlighter-rouge">dirs</code> via breadth-first traversal will have useful
properties later.</p>

<p>The queue is initialized with the root directory, then iterated over until
the iterator reaches the end. Additional directories may added during
iteration, per the last section.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="n">root</span> <span class="o">=</span> <span class="n">ndirs</span><span class="o">++</span><span class="p">;</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">root</span><span class="p">].</span><span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="s">L"."</span><span class="p">);</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">root</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>  <span class="c1">// terminator</span>

<span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">parent</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">parent</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">parent</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ... FindFirstFileW / FindNextFileW ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When the loop exits, the program has traversed the full tree. Counts are
now propagated up the tree using the <code class="language-plaintext highlighter-rouge">link</code> field, pointing from leaves to
root. In this direction it’s just a linked list. Propagation starts at the
root and works towards leaves to avoid multiple-counting, and the
breadth-first <code class="language-plaintext highlighter-rouge">dirs</code> is already ordered for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">j</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">link</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nbytes</span><span class="p">;</span>
        <span class="n">dirs</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">nlines</span> <span class="o">+=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nlines</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this is really another traversal, this could be done during the
first traversal. However, line counting will be done concurrently, and
it’s easier, and probably more efficient, to propagate concurrent results
after the concurrent part of the code is complete.</p>

<h3 id="inverting-the-tree-links">Inverting the tree links</h3>

<p>Printing the graph will require a depth-first traversal. Given an entry,
the program will iterate over its children. However, the tree links are
currently backwards, pointing from child to parent:</p>

<p><a href="/img/diagram/bfs0.dot"><img src="/img/diagram/bfs0.png" alt="" /></a></p>

<p>To traverse from root to leaves, those links will need to be inverted:</p>

<p><a href="/img/diagram/bfs1.dot"><img src="/img/diagram/bfs1.png" alt="" /></a></p>

<p>However, there’s only one <code class="language-plaintext highlighter-rouge">link</code> on each node, but potentially multiple
children. The breadth-first traversal comes to the rescue: All child nodes
for a given directory are adjacent in <code class="language-plaintext highlighter-rouge">dirs</code>. If <code class="language-plaintext highlighter-rouge">link</code> points to the
first child, finding the rest is trivial. There’s an implicit link between
siblings by virtue of position:</p>

<p><a href="/img/diagram/bfs2.dot"><img src="/img/diagram/bfs2.png" alt="" /></a></p>

<p>An entry’s first child immediately follows the previous entry’s last
child. So to flip the links around, manually establish the root’s <code class="language-plaintext highlighter-rouge">link</code>
field, then walk the tree breadth-first and hook <code class="language-plaintext highlighter-rouge">link</code> up to each entry’s
children based on the previous entry’s <code class="language-plaintext highlighter-rouge">link</code> and <code class="language-plaintext highlighter-rouge">nsubdirs</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dirs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">link</span> <span class="o">+</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The tree is now restructured for sorting and depth-first traversal.</p>

<h3 id="sort-by-line-count">Sort by line count</h3>

<p>I won’t include it here, but I have a <code class="language-plaintext highlighter-rouge">qsort</code>-compatible comparison
function, <code class="language-plaintext highlighter-rouge">dircmp</code> that compares by line count descending, then by name
ascending. As a file system tree, siblings cannot have equal names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dircmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Since child entries are adjacent, it’s a trivial to <code class="language-plaintext highlighter-rouge">qsort</code> each entry’s
children. A loop sorts the whole tree:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndirs</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">dir</span> <span class="o">*</span><span class="n">beg</span> <span class="o">=</span> <span class="n">dirs</span> <span class="o">+</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">link</span><span class="p">;</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">beg</span><span class="p">,</span> <span class="n">dirs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">dirs</span><span class="p">),</span> <span class="n">dircmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We’re almost to the finish line.</p>

<h3 id="depth-first-traversal">Depth-first traversal</h3>

<p>As I said, recursion makes me nervous, so I took the slightly more
complicated route of an explicit stack. Path components must be separated
by a backslash delimiter, so the deepest possible stack is <code class="language-plaintext highlighter-rouge">MAX_PATH/2</code>.
Each stack element tracks a directory handle (<code class="language-plaintext highlighter-rouge">d</code>) and a subdirectory
index (<code class="language-plaintext highlighter-rouge">i</code>).</p>

<p>I have a <code class="language-plaintext highlighter-rouge">printstat</code> to output an entry. It takes an entry, the string
buffer, and a depth for indentation level.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">printstat</span><span class="p">(</span><span class="k">struct</span> <span class="n">dir</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">struct</span> <span class="n">buf</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">int</span> <span class="n">depth</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s a simplified depth-first traversal calling <code class="language-plaintext highlighter-rouge">printstat</code>. (The real
one has to make decisions about when to stop and summarize, and it’s
dominated by edge cases.) I initialize the stack with the root directory,
then loop until it’s empty.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// top of stack</span>
<span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">d</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span> <span class="n">stack</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="o">/</span><span class="mi">2</span><span class="p">];</span>

<span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">printstat</span><span class="p">(</span><span class="n">dirs</span><span class="o">+</span><span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>

<span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">d</span> <span class="o">=</span> <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span><span class="o">++</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">d</span><span class="p">].</span><span class="n">nsubdirs</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">n</span><span class="o">--</span><span class="p">;</span>  <span class="c1">// pop</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">dirs</span><span class="p">[</span><span class="n">d</span><span class="p">].</span><span class="n">link</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">printstat</span><span class="p">(</span><span class="n">dirs</span><span class="o">+</span><span class="n">cur</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
        <span class="n">n</span><span class="o">++</span><span class="p">;</span>  <span class="c1">// push</span>
        <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">d</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
        <span class="n">stack</span><span class="p">[</span><span class="n">n</span><span class="p">].</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="concurrency">Concurrency</h3>

<p>At this point the “process file” part of traversal was a straightforward
<code class="language-plaintext highlighter-rouge">CreateFile</code>, <code class="language-plaintext highlighter-rouge">ReadFile</code> loop, <code class="language-plaintext highlighter-rouge">CloseHandle</code>. I suspected it spent most of
its time in the loop counting newlines since I didn’t do anything special,
<a href="/blog/2021/12/04/">like SIMD</a>, aside from <a href="/blog/2019/12/09/">not over-constraining code
generation</a>.</p>

<p>However after taking some measurements, I found the program was spending
99.9% its time waiting on Win32 functions. <code class="language-plaintext highlighter-rouge">CreateFile</code> was the most
expensive at nearly 50% of the total run time, and even <code class="language-plaintext highlighter-rouge">CloseHandle</code> was
a substantial blocker. These two alone meant overlapped I/O wouldn’t help
much, and threads were necessary to run these Win32 blockers concurrently.
Counting newlines, even over gigabytes of data, was practically free, and
so required no further attention.</p>

<p>So I set up <a href="/blog/2022/05/14/">my lock-free work queue</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define QUEUE_LEN (1&lt;&lt;15)
</span><span class="k">struct</span> <span class="n">queue</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">q</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">d</span><span class="p">[</span><span class="n">QUEUE_LEN</span><span class="p">];</span>
    <span class="kt">int32_t</span> <span class="n">f</span><span class="p">[</span><span class="n">QUEUE_LEN</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>As before, <code class="language-plaintext highlighter-rouge">q</code> here is the atomic. A max-size queue for <code class="language-plaintext highlighter-rouge">QUEUE_LEN</code> worked
best in my tests. Larger queues were rarely full. Or empty, except at
startup and shutdown. Queue elements are a pair of directory handle (<code class="language-plaintext highlighter-rouge">d</code>)
and file string handle (<code class="language-plaintext highlighter-rouge">f</code>), stored in separate arrays.</p>

<p>I didn’t need to push the file name strings into the string buffer before,
but now it’s a great way to supply strings to other threads. I push the
string into the buffer, then send the handle through the queue. The
recipient re-constructs the path on its end using the directory tree and
this file name. Unfortunately this puts more stress on the string buffer,
which is why I had to max out the size, but it’s worth it.</p>

<p>The “process files” part now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="n">fd</span><span class="p">.</span><span class="n">nFileSizeLow</span><span class="p">;</span>
<span class="n">dirs</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">nbytes</span> <span class="o">+=</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">fd</span><span class="p">.</span><span class="n">nFileSizeHigh</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span><span class="p">;</span>

<span class="kt">int32_t</span> <span class="n">name</span> <span class="o">=</span> <span class="n">buf_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">,</span> <span class="n">fd</span><span class="p">.</span><span class="n">cFileName</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_send</span><span class="p">(</span><span class="o">&amp;</span><span class="n">queue</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">name</span><span class="p">))</span> <span class="p">{</span>
    <span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
    <span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">buf</span><span class="p">.</span><span class="n">buf</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="n">processfile</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">parent</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">queue_send()</code> returns false then the queue is full, so it processes
the job itself. There might be room later for the next file.</p>

<p>Worker threads look similar, spinning until an item arrives in the queue:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int32_t</span> <span class="n">d</span><span class="p">;</span>
        <span class="kt">int32_t</span> <span class="n">name</span><span class="p">;</span>
        <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_recv</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">d</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">name</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">d</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="kt">wchar_t</span> <span class="n">path</span><span class="p">[</span><span class="n">MAX_PATH</span><span class="p">];</span>
        <span class="n">buildpath</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
        <span class="n">processfile</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">dirs</span><span class="p">,</span> <span class="n">d</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>A special directory entry handle of -1 tells the worker to exit. When
traversal completes, the main thread becomes a worker until the queue
empties, pushes one termination handle for each worker thread, then joins
the worker threads — a synchronization point that indicates all work is
complete, and the main thread can move on to propagation and sorting.</p>

<p>This was a substantial performance boost. At least on my system, running
just 4 threads total is enough to saturate the Win32 interface, and
additional threads do not make the program faster despite more available
cores.</p>

<p>Aside from overall portability, I’m quite happy with the results.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A lock-free, concurrent, generic queue in 32 bits</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/05/14/"/>
    <id>urn:uuid:b5a6b85a-19af-4f2f-8a32-0098f6e87edb</id>
    <updated>2022-05-14T04:22:24Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=31384602">on Hacker News</a>.</em></p>

<p>While considering concurrent queue design I came up with a generic,
lock-free queue that fits in a 32-bit integer. The queue is “generic” in
that a single implementation supports elements of any arbitrary type,
despite an implementation in C. It’s lock-free in that there is guaranteed
system-wide progress. It can store up to 32,767 elements at a time — more
than enough for message queues, which <a href="/blog/2020/05/24/">must always be bounded</a>. I
will first present a single-consumer, single-producer queue, then expand
support to multiple consumers at a cost. Like <a href="/blog/2022/03/13/">my lightweight barrier</a>,
I’m not presenting this as a packaged solution, but rather as a technique
you can apply when circumstances call.</p>

<!--more-->

<p>How can the queue store so many elements when it’s just 32 bits? It only
handles the indexes of a circular buffer. The <a href="/blog/2018/06/10/">caller is responsible</a>
for allocating and manipulating the queue’s storage, which, in the
single-consumer case, doesn’t require anything fancy. Synchronization is
managed by the queue.</p>

<p>Like a typical circular buffer, it has a head index and a tail index. The
head is the next element to be pushed, and the tail is the next element to
be popped. The queue storage must have a power-of-two length, but the
capacity is one less than the length. If the head and tail are equal then
the queue is empty. This “wastes” one element, which is why the capacity
is one less than the length of the storage. So already there are some
notable constraints imposed by this design, but I believe the main use
case for such a queue — a job queue for CPU-bound jobs — has no problem
with these constraints.</p>

<p>Since this is a concurrent queue it’s worth noting “ownership” of storage
elements. The consumer owns elements from the tail up to, but excluding,
the head. The producer owns everything else. Both pushing and popping
involve a “commit” step that transfers ownership of an element to the
other thread. No elements are accessed concurrently, which makes things
easy for either caller.</p>

<h3 id="queue-usage">Queue usage</h3>

<p>Pushing (to the front) and popping (from the back) are each a three-step
process:</p>

<ol>
  <li>Obtain the element index</li>
  <li>Access that element</li>
  <li>Commit the operation</li>
</ol>

<p>I’ll be using C11 atomics for my implementation, but it should be easy to
translate these into something else no matter the programming language. As
I mentioned, the queue fits in a 32-bit integer, and so it’s represented
by an <code class="language-plaintext highlighter-rouge">_Atomic uint32_t</code>. Here’s the entire interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">queue_pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">queue_pop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">);</span>

<span class="kt">int</span>  <span class="nf">queue_push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">queue_push_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">queue</span><span class="p">);</span>
</code></pre></div></div>

<p>Both <code class="language-plaintext highlighter-rouge">queue_pop</code> and <code class="language-plaintext highlighter-rouge">queue_push</code> return -1 if the queue is empty/full.</p>

<p>To create a queue, initialize an atomic 32-bit integer to zero. Also
choose a size exponent and allocate some storage. Here’s a 63-element
queue of jobs:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EXP 6  // note; 2**6 == 64
</span><span class="k">struct</span> <span class="n">job</span> <span class="n">slots</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
<span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="n">q</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>Rather than a length, the queue functions accept a base-2 exponent, which
is why I’ve defined <code class="language-plaintext highlighter-rouge">EXP</code>. If you don’t like this, you can just accept a
length in your own implementation, though remember it’s constrained to
powers of two. The producer might look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">queue_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while full</span>
    <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">job_create</span><span class="p">();</span>
    <span class="n">queue_push_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is a busy-wait loop, which makes for a simple illustration but isn’t
ideal. In a <a href="/blog/2022/05/22/">real program</a> I’d have the producer run a job while it
waits for a queue slot, or just have it turn into a consumer (if this
wasn’t a single-consumer queue). Similarly, if the queue is empty, then
maybe a consumer turns into the producer. It all depends on the context.</p>

<p>The consumer might look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="n">queue_pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while empty</span>
    <span class="k">struct</span> <span class="n">job</span> <span class="n">job</span> <span class="o">=</span> <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="n">queue_pop_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">);</span>
    <span class="n">job_run</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In either case it’s important that neither touches the element after
committing since that transfers ownership away.</p>

<h3 id="pop-operation">Pop operation</h3>

<p>The queue is actually a pair of 16-bit integers, head and tail, each
stored in the low and high halves of the 32-bit integer. So the first
thing to do is atomically load the integer, then extract these “fields.”</p>

<p>If for some reason a capacity of 32,767 is insufficient, you can trivially
upgrade your queue to an Enterprise Queue: a 64-bit integer with a
capacity of over 2 billion elements. I’m going to stick with the 32-bit
queue.</p>

<p>Starting with the pop operation since it’s simpler:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>  <span class="c1">// consider "acquire"</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the indexes are equal, the queue is empty. Otherwise return the tail
field. The <code class="language-plaintext highlighter-rouge">*q</code> is an atomic load since it’s qualified <code class="language-plaintext highlighter-rouge">_Atomic</code>. The load
might be more efficient if this were an explicit “acquire” operation,
which is what I used in some of my tests.</p>

<p>To complete the pop, atomically increment the tail index so that the
element falls out of the range of elements owned by the consumer. The tail
is the high half of the integer so add <code class="language-plaintext highlighter-rouge">0x10000</code> rather than just 1.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">queue_pop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">q</span> <span class="o">+=</span> <span class="mh">0x10000</span><span class="p">;</span>  <span class="c1">// consider "release"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s harmless if this overflows since it’s congruent with the power-of-two
storage length, and an overflow won’t affect the head index. The increment
might be more efficient if this were an explicit “release” operation,
which, again, is what I used in some of my tests.</p>

<h3 id="push-operation">Push operation</h3>

<p>Pushing is a little more complex. As is typical with circular buffers,
before doing anything it must ensure the result won’t ambiguously create
an empty queue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>  <span class="c1">// consider "acquire"</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">next</span> <span class="o">=</span> <span class="p">(</span><span class="n">head</span> <span class="o">+</span> <span class="mi">1u</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&amp;</span> <span class="mh">0x8000</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// avoid overflow on commit</span>
        <span class="o">*</span><span class="n">q</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="mh">0x8000</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">next</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">head</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s important that incrementing the head field won’t overflow into the
tail field, so it atomically clears the high bit if set, giving the
increment overhead into which it can overflow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">queue_push_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">q</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// consider "release"</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="multiple-consumers">Multiple-consumers</h3>

<p>The single producer and single consumer didn’t require locks nor atomic
accesses to the storage array since the queue guaranteed that accesses at
the specified index were not concurrent. However, this is not the case
with multiple-consumers. Consumers race when popping. The loser’s access
might occur after the winner’s commit, making its access concurrent with
the producer. Both producer and consumers must account for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">job</span> <span class="n">slots</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="n">EXP</span><span class="p">];</span>
</code></pre></div></div>

<p>To prepare for multiple consumers, the array now has an atomic qualifier:
one of the costs of multiple consumers. Fortunately these new atomic
accesses can use a “relaxed” ordering since there are no required ordering
constraints. Even if it wasn’t atomic, and <a href="https://lwn.net/Articles/793253/">the load was torn</a>, we’d
detect it when attempting to commit. It’s simply against the rules to have
a data race, and I don’t know how else to avoid it other than dropping
into assembly.</p>

<p>The next cost is that committing can fail. Another consumer might have won
the race, which means you must start over. Here’s my multiple-consumer
interface, which I’ve uncreatively called <code class="language-plaintext highlighter-rouge">mpop</code> (“multiple-consumer
pop”). Besides a <code class="language-plaintext highlighter-rouge">_Bool</code> for indicating failure, the main change is a new
<code class="language-plaintext highlighter-rouge">save</code> parameter:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>   <span class="nf">queue_mpop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">save</span><span class="p">);</span>
<span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">save</span><span class="p">);</span>
</code></pre></div></div>

<p>The caller must carry some temporary state (<code class="language-plaintext highlighter-rouge">save</code>), which is how failures
are detected, ultimately communicated by that <code class="language-plaintext highlighter-rouge">_Bool</code> return.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">int32_t</span> <span class="n">save</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">job</span> <span class="n">job</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="k">do</span> <span class="p">{</span>
            <span class="n">i</span> <span class="o">=</span> <span class="n">queue_mpop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">EXP</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// note: busy-wait while empty</span>
        <span class="n">job</span> <span class="o">=</span> <span class="n">slots</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">queue_mpop_commit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="p">,</span> <span class="n">save</span><span class="p">));</span>
    <span class="n">job_run</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s important that the consumer doesn’t attempt to use <code class="language-plaintext highlighter-rouge">job</code> until a
successful commit, since it might not be valid. As noted, that load could
be relaxed (what a mouthful):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">job</span> <span class="o">=</span> <span class="n">atomic_load_explicit</span><span class="p">(</span><span class="n">slots</span><span class="o">+</span><span class="n">i</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s the pop implementation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_mpop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">save</span> <span class="o">=</span> <span class="o">*</span><span class="n">q</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">r</span>     <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="n">r</span><span class="o">&gt;&gt;</span><span class="mi">16</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far it’s exactly the same, except it stores a full snapshot of the
queue state in <code class="language-plaintext highlighter-rouge">*save</code>. This is needed for a compare-and-swap (CAS) in the
commit, which checks that the queue hasn’t been modified concurrently
(i.e. by another consumer):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_compare_exchange_strong</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">,</span> <span class="n">save</span><span class="o">+</span><span class="mh">0x10000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As always with CAS, we must be wary of <a href="/blog/2014/09/02/">the ABA problem</a>. Imagine
that between starting to pop and this CAS that the producer and another
consumer looped over the entire queue and ended up back at exactly the
same spot as where we started. The queue would look like we expect, and
the commit would “succeed” despite reading a garbage value.</p>

<p>Fortunately this matches the entire 32-bit state, and so a small queue
capacity is not at a greater risk. The tail counter is always 16 bits, and
the head counter is 15 bits (due to keeping the 16th clear for overflow).
The chance of them landing at exactly the same count is low. Though if
those odds aren’t low enough, as mentioned you can always upgrade to the
64-bit Enterprise Queue with larger counters.</p>

<p>There’s a notable performance defect with this particular design. If the
producer concurrently pushes a new value, the commit will fail even if
there was no real race since only the head field changed. It would be
better if the head field was isolated from the tail field…</p>

<h3 id="a-less-cheeky-design">A less cheeky design</h3>

<p>You might have noticed that there’s little reason to pack two 16-bit
counters into a 32-bit integer. These could just be fields in a structure:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">queue</span> <span class="p">{</span>
    <span class="k">_Atomic</span> <span class="kt">uint16_t</span> <span class="n">head</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="kt">uint16_t</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>While this entire structure can be atomically loaded just like the 32-bit
integer, C11 (and later) do not permit non-atomic accesses to these atomic
fields in an unshared copy loaded from an atomic. So I’d either use
compiler-specific built-ins for atomics — much more flexible, and what I
prefer anyway — or just load them individually:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">queue_pop</span><span class="p">(</span><span class="k">struct</span> <span class="n">queue</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">int</span> <span class="n">exp</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="o">*</span><span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1u</span> <span class="o">&lt;&lt;</span> <span class="n">exp</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">head</span> <span class="o">=</span> <span class="n">q</span><span class="o">-&gt;</span><span class="n">head</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">tail</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">save</span> <span class="o">=</span> <span class="n">q</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">mask</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">head</span> <span class="o">==</span> <span class="n">tail</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="n">tail</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Technically with two loads this could extract a <code class="language-plaintext highlighter-rouge">head</code>/<code class="language-plaintext highlighter-rouge">tail</code> pair that
were never contemporaneous. The worst case is the queue appears empty even
if it was never actually empty.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">_Bool</span> <span class="nf">queue_mpop_commit</span><span class="p">(</span><span class="k">struct</span> <span class="n">queue</span> <span class="o">*</span><span class="n">q</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">save</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_compare_exchange_strong</span><span class="p">(</span><span class="o">&amp;</span><span class="n">q</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">save</span><span class="p">,</span> <span class="n">save</span><span class="o">+</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the head index isn’t part of the CAS, the producer can’t interfere
with the commit. (Though there’s still certainly false sharing happening.)</p>

<h3 id="real-implementation-and-tests">Real implementation and tests</h3>

<p>If you want to try it out, especially with my tests: <a href="https://github.com/skeeto/scratch/blob/master/misc/queue.c"><strong>queue.c</strong></a>.
It has both single-consumer and multiple-consumer queues, and supports at
least:</p>

<ul>
  <li>atomics: C11, GNU, MSC</li>
  <li>threads: pthreads, win32</li>
  <li>compilers: GCC, Clang, MSC</li>
  <li>hosts: Linux, Windows, BSD</li>
</ul>

<p>Since I wanted to test across a variety of implementations, especially
under Thread Sanitizer (TSan). On a similar note, I also implemented a
concurrent queue shared between C and Go: <a href="https://github.com/skeeto/scratch/blob/master/misc/queue.go"><strong>queue.go</strong></a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Luhn algorithm using SWAR and SIMD</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/04/30/"/>
    <id>urn:uuid:2bb8fbd6-4197-4799-8258-861d316a7086</id>
    <updated>2022-04-30T17:53:05Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Ever been so successful that credit card processing was your bottleneck?
Perhaps you’ve wondered, “If only I could compute check digits three times
faster using the same hardware!” Me neither. But if that ever happens
someday, then this article is for you. I will show how to compute the
<a href="https://en.wikipedia.org/wiki/Luhn_algorithm">Luhn algorithm</a> in parallel using <em>SIMD within a register</em>, or
SWAR.</p>

<p>If you want to skip ahead, here’s the full source, tests, and benchmark:
<a href="https://github.com/skeeto/scratch/blob/master/misc/luhn.c"><code class="language-plaintext highlighter-rouge">luhn.c</code></a></p>

<p>The Luhn algorithm isn’t just for credit card numbers, but they do make a
nice target for a SWAR approach. The major payment processors use <a href="https://www.paypalobjects.com/en_GB/vhelp/paypalmanager_help/credit_card_numbers.htm">16
digit numbers</a> — i.e. 16 ASCII bytes — and typical machines today have
8-byte registers, so the input fits into two machine registers. In this
context, the algorithm works like so:</p>

<ol>
  <li>
    <p>Consider the digits number as an array, and double every other digit
starting with the first. For example, 6543 becomes 12, 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum individual digits in each element. The example becomes 3 (i.e.
1+2), 5, 8, 3.</p>
  </li>
  <li>
    <p>Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.</p>
  </li>
</ol>

<p>I will implement this algorithm in C with this prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>It assumes the input is 16 bytes and only contains digits, and it will
return the Luhn sum. Callers either validate a number by comparing the
result to zero, or use it to compute a check digit when generating a
number. (Read: You could use SWAR to rapidly generate valid numbers.)</p>

<p>The plan is to process the 16-digit number in two halves, and so first
load the halves into 64-bit registers, which I’m calling <code class="language-plaintext highlighter-rouge">hi</code> and <code class="language-plaintext highlighter-rouge">lo</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">hi</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">0</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">1</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">2</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">3</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">4</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">5</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">6</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">7</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">lo</span> <span class="o">=</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">8</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span> <span class="mi">9</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">11</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">13</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
    <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">14</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">;</span>
</code></pre></div></div>

<p>This looks complicated and possibly expensive, but it’s really just an
idiom for loading a little endian 64-bit integer from a buffer. Breaking
it down:</p>

<ul>
  <li>
    <p>The input, <code class="language-plaintext highlighter-rouge">*s</code>, is <code class="language-plaintext highlighter-rouge">char</code>, which may be signed on some architectures. I
chose this type since it’s the natural type for strings. However, I do
not want sign extension, so I mask the low byte of the possibly-signed
result by ANDing with 255. It’s as though <code class="language-plaintext highlighter-rouge">*s</code> was <code class="language-plaintext highlighter-rouge">unsigned char</code>.</p>
  </li>
  <li>
    <p>The shifts assemble the 64-bit result in little endian byte order
<a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">regardless of the host machine byte order</a>. In other words, this
will produce correct results even on big endian hosts.</p>
  </li>
  <li>
    <p>I chose little endian since it’s the natural byte order for all the
architectures I care about. Big endian hosts may pay a cost on this load
(byte swap instruction, etc.). The rest of the function could just as
easily be computed over a big endian load if I was primarily targeting a
big endian machine instead.</p>
  </li>
  <li>
    <p>I could have used <code class="language-plaintext highlighter-rouge">unsigned long long</code> (i.e. <em>at least</em> 64 bits) since
no part of this function requires <em>exactly</em> 64 bits. I chose <code class="language-plaintext highlighter-rouge">uint64_t</code>
since it’s succinct, and in practice, every implementation supporting
<code class="language-plaintext highlighter-rouge">long long</code> also defines <code class="language-plaintext highlighter-rouge">uint64_t</code>.</p>
  </li>
</ul>

<p>Both GCC and Clang figure this all out and produce perfect code. On
x86-64, just one instruction for each statement:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Or, more impressively, loading both using a <em>single instruction</em> on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">ldp</span>  <span class="nv">x0</span><span class="p">,</span> <span class="nv">x1</span><span class="p">,</span> <span class="p">[</span><span class="nv">x0</span><span class="p">]</span>
</code></pre></div></div>

<p>The next step is to decode ASCII into numeric values. This is <a href="https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/">trivial and
common</a> in SWAR, and only requires subtracting <code class="language-plaintext highlighter-rouge">'0'</code> (<code class="language-plaintext highlighter-rouge">0x30</code>). So long
as there is no overflow, this can be done lane-wise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">-=</span> <span class="mh">0x3030303030303030</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–9. Next, double every
other digit. Multiplication in SWAR is not easy, but doubling just means
adding the odd lanes to themselves. I can mask out the lanes that are not
doubled. Regarding the mask, recall that the least significant byte is the
first byte (little endian).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="n">lo</span> <span class="o">&amp;</span> <span class="mh">0x00ff00ff00ff00ff</span><span class="p">;</span>
</code></pre></div></div>

<p>Each byte of the register now contains values in 0–18. Now for the tricky
problem of folding the tens place into the ones place. Unlike 8 or 16, 10
is not a particularly convenient base for computers, especially since SWAR
lacks lane-wide division or modulo. Perhaps a lane-wise <a href="https://en.wikipedia.org/wiki/Binary-coded_decimal">binary-coded
decimal</a> could solve this. However, I have a better trick up my
sleeve.</p>

<p>Consider that the tens place is either 0 or 1. In other words, we really
only care if the value in the lane is greater than 9. If I add 6 to each
lane, the 5th bit (value 16) will definitely be set in any lanes that were
previously at least 10. I can use that bit as the tens place.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="p">(</span><span class="n">hi</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
<span class="n">lo</span> <span class="o">+=</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="mh">0x0006000600060006</span><span class="p">)</span><span class="o">&gt;&gt;</span><span class="mi">4</span> <span class="o">&amp;</span> <span class="mh">0x0001000100010001</span><span class="p">;</span>
</code></pre></div></div>

<p>This code adds 6 to the doubled lanes, shifts the 5th bit to the least
significant position in the lane, masks for just that bit, and adds it
lane-wise to the total. Only applying this to doubled lanes is a style
decision, and I could have applied it to all lanes for free.</p>

<p>The astute might notice I’ve strayed from the stated algorithm. A lane
that was holding, say, 12 now hold 13 rather than 3. Since the final
result of the algorithm is modulo 10, leaving the tens place alone is
harmless, so this is fine.</p>

<p>At this point each lane contains values in 0–19. Now that the tens
processing is done, I can combine the halves into one register with a
lane-wise sum:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">lo</span><span class="p">;</span>
</code></pre></div></div>

<p>Each lane contains values in 0–38. I would have preferred to do this
sooner, but that would have complicated tens place handling. Even if I had
rotated the doubled lanes in one register to even out the sums, some lanes
may still have had a 2 in the tens place.</p>

<p>The final step is a horizontal sum reduction using the typical SWAR
approach. Add the top half of the register to the bottom half, then the
top half of what’s left to the bottom half, etc.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">+=</span> <span class="n">hi</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
</code></pre></div></div>

<p>Before the sum I said each lane was 0–38, so couldn’t this sum be as high
as 304 (8x38)? It would overflow the lane, giving an incorrect result.
Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled
lanes. That’s a maximum of 224, which fits in the result lane without
overflow. Whew! I’ve been tracking the range all along to guard against
overflow like this.</p>

<p>Finally mask the result lane and return it modulo 10:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="p">(</span><span class="n">hi</span><span class="o">&amp;</span><span class="mi">255</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
</code></pre></div></div>

<p>On my machine, SWAR is around 3x faster than a straightforward
digit-by-digit implementation.</p>

<h3 id="usage-examples">Usage examples</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">is_valid</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">random_credit_card</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="s">"%015llu0"</span><span class="p">,</span> <span class="n">rand64</span><span class="p">()</span><span class="o">%</span><span class="mi">1000000000000000</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span> <span class="o">=</span> <span class="sc">'0'</span> <span class="o">+</span> <span class="mi">10</span> <span class="o">-</span> <span class="n">luhn</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="simd">SIMD</h3>

<p>Conveniently, all the SWAR operations translate directly into SSE2
instructions. If you understand the SWAR version, then this is easy to
follow:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>

    <span class="c1">// decode ASCII</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mh">0x30</span><span class="p">));</span>

    <span class="c1">// double every other digit</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x00ff</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">m</span><span class="p">));</span>

    <span class="c1">// extract and add tens digit</span>
    <span class="n">__m128i</span> <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="mh">0x0006</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_srai_epi32</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t</span><span class="p">);</span>

    <span class="c1">// horizontal sum</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_set1_epi32</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On my machine, the SIMD version is around another 3x increase over SWAR,
and so nearly an order of magnitude faster than a digit-by-digit
implementation.</p>

<p><em>Update</em>: Const-me on Hacker News <a href="https://news.ycombinator.com/item?id=31320853">suggests a better option</a> for
handling the tens digit in the function above, shaving off 7% of the
function’s run time on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// if (digit &gt; 9) digit -= 9</span>
    <span class="n">__m128i</span> <span class="n">nine</span> <span class="o">=</span> <span class="n">_mm_set1_epi8</span><span class="p">(</span><span class="mi">9</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">gt</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">nine</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_mm_sub_epi8</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">gt</span><span class="p">,</span> <span class="n">nine</span><span class="p">));</span>
</code></pre></div></div>

<p><em>Update</em>: u/aqrit on reddit has come up with a more optimized SSE2
solution, 12% faster than mine on my machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">luhn</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__m128i</span> <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_loadu_si128</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">);</span>
    <span class="n">__m128i</span> <span class="n">m</span> <span class="o">=</span> <span class="n">_mm_cmpgt_epi8</span><span class="p">(</span><span class="n">_mm_set1_epi16</span><span class="p">(</span><span class="sc">'5'</span><span class="p">),</span> <span class="n">v</span><span class="p">);</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_slli_epi16</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">8</span><span class="p">));</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span>  <span class="c1">// subtract 1 if less than 5</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_sad_epu8</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_setzero_si128</span><span class="p">());</span>
    <span class="n">v</span> <span class="o">=</span> <span class="n">_mm_add_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">_mm_shuffle_epi32</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">));</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">_mm_cvtsi128_si32</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">)</span> <span class="o">%</span> <span class="mi">10</span><span class="p">;</span>
    <span class="c1">// (('0' * 24) - 8) % 10 == 4</span>
<span class="p">}</span>
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A flexible, lightweight, spin-lock barrier</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/03/13/"/>
    <id>urn:uuid:5a72d27a-60f4-4b52-a4c2-f1c3b72e6c85</id>
    <updated>2022-03-13T23:55:08Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=30671979">on Hacker News</a>.</em></p>

<p>The other day I wanted try the famous <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">memory reordering experiment</a>
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an <a href="https://research.swtch.com/hwmm">“impossible” result</a> on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.</p>

<!--more-->

<p>Here’s the entire barrier implementation for two threads in C11.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for two threads. Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="mi">2</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or in Go:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">BarrierWait</span><span class="p">(</span><span class="n">barrier</span> <span class="o">*</span><span class="kt">uint32</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">v</span> <span class="o">:=</span> <span class="n">atomic</span><span class="o">.</span><span class="n">AddUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">v</span><span class="o">&amp;</span><span class="m">1</span> <span class="o">==</span> <span class="m">1</span> <span class="p">{</span>
        <span class="n">v</span> <span class="o">&amp;=</span> <span class="m">2</span>
        <span class="k">for</span> <span class="n">atomic</span><span class="o">.</span><span class="n">LoadUint32</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="m">2</span> <span class="o">==</span> <span class="n">v</span> <span class="p">{</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.</p>

<p>When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of <a href="https://web.archive.org/web/20151109230817/https://stackoverflow.com/questions/33598686/spinning-thread-barrier-using-atomic-builtins">subtly-incorrect</a> spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.</p>

<p>Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.</p>

<h3 id="experiment">Experiment</h3>

<p>Here’s the setup for the memory reordering experiment, where <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>
are initialized to zero.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0
</code></pre></div></div>

<p>Considering all the possible orderings, it would seem that at least one of
<code class="language-plaintext highlighter-rouge">r0</code> or <code class="language-plaintext highlighter-rouge">r1</code> is 1. There seems to be no ordering where <code class="language-plaintext highlighter-rouge">r0</code> and <code class="language-plaintext highlighter-rouge">r1</code> could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.</p>

<p>How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use <code class="language-plaintext highlighter-rouge">volatile</code> for <code class="language-plaintext highlighter-rouge">w0</code> and <code class="language-plaintext highlighter-rouge">w1</code>. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
<code class="language-plaintext highlighter-rouge">volatile</code>.</p>

<p>So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"movl  $1, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"movl  %2, %0</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="o">*</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="o">*</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>ARM64 (to try on my Raspberry Pi):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kr">__asm</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"str  %w0, %1</span><span class="se">\n</span><span class="s">"</span>
        <span class="s">"ldr  %w0, %2</span><span class="se">\n</span><span class="s">"</span>
        <span class="o">:</span> <span class="s">"+r"</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="s">"=m"</span><span class="p">(</span><span class="n">w0</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"m"</span><span class="p">(</span><span class="n">w1</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">r1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
<code class="language-plaintext highlighter-rouge">static</code>.</p>

<p>Alternatively, I could use C11 atomics with a relaxed memory order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">experiment</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w0</span><span class="p">,</span> <span class="k">_Atomic</span> <span class="kt">int</span> <span class="o">*</span><span class="n">w1</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">atomic_store_explicit</span><span class="p">(</span><span class="n">w0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">atomic_load_explicit</span><span class="p">(</span><span class="n">w1</span><span class="p">,</span> <span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this is a <em>race</em> and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of <em>starting barrier</em>… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">w0</span><span class="p">,</span> <span class="n">w1</span><span class="p">,</span> <span class="n">r0</span><span class="p">,</span> <span class="n">r1</span><span class="p">;</span>

<span class="c1">// thread#1                   // thread#2</span>
<span class="n">w0</span> <span class="o">=</span> <span class="n">w1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>
<span class="n">r1</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w1</span><span class="p">);</span>    <span class="n">r0</span> <span class="o">=</span> <span class="n">experiment</span><span class="p">(</span><span class="o">&amp;</span><span class="n">w1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">w0</span><span class="p">);</span>
<span class="n">BARRIER</span><span class="p">;</span>                      <span class="n">BARRIER</span><span class="p">;</span>

<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">r0</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">r1</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"impossible!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.</p>

<p>Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.</p>

<h3 id="barrier-selection">Barrier selection</h3>

<p>On my first attempt, I made the obvious decision for the barrier: I used
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_barrier_wait.html"><code class="language-plaintext highlighter-rouge">pthread_barrier_t</code></a>. I was already using pthreads for spawning the
extra thread, including <a href="/blog/2020/05/15/">on Windows</a>, so this was convenient.</p>

<p>However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a <em>global</em>
lock <em>twice</em> per wait to manage the barrier’s reference counter.</p>

<p>All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a <em>spin-lock barrier</em>.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.</p>

<h3 id="barrier-design">Barrier design</h3>

<p>Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to <code class="language-plaintext highlighter-rouge">w0</code>, <code class="language-plaintext highlighter-rouge">w1</code>) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.</p>

<p>I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.</p>

<p>At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait1</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Or to avoid an extra load, use the result directly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">broken_wait2</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">++*</span><span class="n">barrier</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.</p>

<p>To fix this, the wait function must also track the <em>phase</em>. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently <strong>the rest of the integer acts like a phase counter</strong>!
Writing this out more explicitly:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">observed</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">thread_count</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">thread_count</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// not last arrival, watch for phase change</span>
        <span class="kt">unsigned</span> <span class="n">init_phase</span> <span class="o">=</span> <span class="n">observed</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
            <span class="kt">unsigned</span> <span class="n">current_phase</span> <span class="o">=</span> <span class="o">*</span><span class="n">barrier</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current_phase</span> <span class="o">!=</span> <span class="n">init_phase</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.</p>

<p>By the way, I’m using <code class="language-plaintext highlighter-rouge">unsigned</code> since it may eventually overflow, and
even <code class="language-plaintext highlighter-rouge">_Atomic int</code> overflow is undefined for the <code class="language-plaintext highlighter-rouge">++</code> operator. However,
if you use <code class="language-plaintext highlighter-rouge">atomic_fetch_add</code> or C++ <code class="language-plaintext highlighter-rouge">std::atomic</code> then overflow is
defined and you can use <code class="language-plaintext highlighter-rouge">int</code>.</p>

<p>Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (<code class="language-plaintext highlighter-rouge">&gt;&gt;</code>), I
mask (<code class="language-plaintext highlighter-rouge">&amp;</code>) the phase bit with 2.</p>

<p>With this spin-lock barrier, the experiment observes <code class="language-plaintext highlighter-rouge">r0 = r1 = 0</code> in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.</p>

<h3 id="generalizing-to-more-threads">Generalizing to more threads</h3>

<p>Two threads required two bits. This generalizes to <code class="language-plaintext highlighter-rouge">log2(n)+1</code> bits for
<code class="language-plaintext highlighter-rouge">n</code> threads, where <code class="language-plaintext highlighter-rouge">n</code> is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="kt">void</span> <span class="nf">barrier_waitn</span><span class="p">(</span><span class="k">_Atomic</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">v</span> <span class="o">=</span> <span class="o">++*</span><span class="n">barrier</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">barrier</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note: <strong>It never makes sense for <code class="language-plaintext highlighter-rouge">n</code> to exceed the logical core count!</strong>
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.</p>

<p>If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a <code class="language-plaintext highlighter-rouge">uint64_t</code> — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the <code class="language-plaintext highlighter-rouge">&amp;</code> operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.</p>

<p>While C11 <code class="language-plaintext highlighter-rouge">_Atomic</code> seems like it would be useful, unsurprisingly it is
not supported by one major, <a href="/blog/2021/12/30/">stubborn</a> implementation. If you’re
using C++11 or later, then go ahead use <code class="language-plaintext highlighter-rouge">std::atomic&lt;int&gt;</code> since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif
</span>
<span class="c1">// Spin-lock barrier for n threads, where n is a power of two.</span>
<span class="c1">// Initialize *barrier to zero.</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">barrier_wait</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">barrier</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="n">BARRIER_INC</span><span class="p">(</span><span class="n">barrier</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">v</span> <span class="o">&amp;=</span> <span class="n">n</span><span class="p">;</span> <span class="p">(</span><span class="n">BARRIER_GET</span><span class="p">(</span><span class="n">barrier</span><span class="p">)</span><span class="o">&amp;</span><span class="n">n</span><span class="p">)</span> <span class="o">==</span> <span class="n">v</span><span class="p">;);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This has the nice bonus that the interface does not have the <code class="language-plaintext highlighter-rouge">_Atomic</code>
qualifier, nor <code class="language-plaintext highlighter-rouge">std::atomic</code> template. It’s just a plain old <code class="language-plaintext highlighter-rouge">int</code>, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.</p>

<p>If you’d like to try the experiment yourself: <a href="https://gist.github.com/skeeto/c63b9ddf2c599eeca86356325b93f3a7"><code class="language-plaintext highlighter-rouge">reorder.c</code></a>. If
you’d like to see a test of Go and C sharing a thread barrier:
<a href="https://gist.github.com/skeeto/bdb5a0d2aa36b68b6f66ca39989e1444"><code class="language-plaintext highlighter-rouge">coop.go</code></a>.</p>

<p>I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe <a href="https://vimeo.com/644068002">context is
everything</a>. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Compressing and embedding a Wordle word list</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/03/07/"/>
    <id>urn:uuid:95e1a2c2-c1b6-4472-9954-7bc76b4bab10</id>
    <updated>2022-03-07T03:22:41Z</updated>
    <category term="c"/><category term="python"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p><a href="https://en.wikipedia.org/wiki/Wordle">Wordle</a> is all the rage, resulting in an explosion of hobbyist clones,
with new ones appearing every day. At the current rate I estimate by the
end of 2022 that 99% of all new software releases will be Wordle clones.
That’s no surprise since the rules are simple, it’s more fun to implement
and study than to actually play, and the hard part is building a decent
user interface. Such implementations go back <a href="https://www.youtube.com/watch?v=Yi2mTMWC4BM&amp;t=1270s">at least 30 years</a>.
Implementers get to decide on a platform, language, and the particular
subject of this article: how to handle the word list. Is it a separate
file/database or <a href="/blog/2016/11/15/">embedded in the program</a>? If embedded, is it
worth compressing? In this article I’ll present a simple, tailored Wordle
list compression strategy that beats general purpose compressors.</p>

<p>Last week one particular <a href="/blog/2020/11/17/">QuickBASIC</a> clone, <a href="http://grahamdowney.com/software/WorDOSle/WorDOSle.htm">WorDOSle</a>, caught my
eye. It embeds its word list despite the dire constraints of its 16-bit
platform. The original Wordle list (<a href="https://gist.github.com/cfreshman/cdcdf777450c5b5301e439061d29694c">1</a>, <a href="https://gist.github.com/cfreshman/a03ef2cba789d8cf00c08f767e0fad7b">2</a>) has 12,972 words which,
naively stored, would consume 77,832 bytes (5 letters, plus newline).
Sadly this exceeds a 16-bit address space. Eliminating the redundant
newline delimiter brings it down to 64,860 bytes — just small enough to
fit in an 8086 segment, but probably still difficult to manage from
QuickBASIC.</p>

<p>The author made a trade-off, reducing the word list to a more manageable,
if meager, 2,318 words, wisely excluding delimiters. Otherwise no further
effort made towards reducing the size. The list is sorted, and the program
cleverly tests words against the list in place using a binary search.</p>

<h3 id="compaction-baseline">Compaction baseline</h3>

<p>Before getting into any real compression technologies, there’s low hanging
fruit to investigate. Words are exactly five, case-insensitive, English
language letters: a–z. To illustrate, here are the first 100 5-letter
words from a short Wordle word list.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
</code></pre></div></div>

<p>In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (<code class="language-plaintext highlighter-rouge">log2(26)</code>) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (<code class="language-plaintext highlighter-rouge">26**5</code> is ~23.5 bits). A 40% savings just by using a
smarter representation.</p>

<p>With 12,972 words, that’s <strong>38,916 bytes</strong> for the whole list. Any
compression I apply must at least beat this size in order to be worth
using.</p>

<h3 id="letter-frequency">Letter frequency</h3>

<p>Not all letters occur at the same frequency. Here’s the letter frequency
for the original Wordle word list:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a:5990  e:6662  i:3759  m:1976  q: 112  u:2511  y:2074
b:1627  f:1115  j: 291  n:2952  r:4158  v: 694  z: 434
c:2028  g:1644  k:1505  o:4438  s:6665  w:1039
d:2453  h:1760  l:3371  p:2019  t:3295  x: 288
</code></pre></div></div>

<p>When encoding a word, I can save space by spending fewer bits on frequent
letters like <code class="language-plaintext highlighter-rouge">e</code> at the cost of spending more bits on infrequent letters
like <code class="language-plaintext highlighter-rouge">q</code>. There are multiple approaches, but the simplest is <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman
coding</a>. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.</p>

<p>While my ultimate target is C, I did the frequency analysis, explored the
problem space, and implemented my compressors in Python. I don’t normally
like to use Python, but it <em>is</em> good for one-shot, disposable data
science-y stuff like this. The decompressor will be implemented in C,
partially via meta-programming: Python code generating my C code. Here’s
my letter histogram code:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">words</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">sys</span><span class="p">.</span><span class="n">stdin</span><span class="p">]</span>
<span class="n">hist</span> <span class="o">=</span> <span class="n">collections</span><span class="p">.</span><span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">itertools</span><span class="p">.</span><span class="n">chain</span><span class="p">(</span><span class="o">*</span><span class="n">words</span><span class="p">):</span>
    <span class="n">hist</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>

<p>To build a Huffman coding tree, I’ll need a min-heap (priority queue)
initially filled with nodes representing each letter and its frequency.
While the heap has more than one element, I pop off the two lowest
frequency nodes, create a new parent node with the sum of their
frequencies, and push it into the heap. When the heap has one element, the
remaining element is the root of the Huffman coding tree.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">huffman</span><span class="p">(</span><span class="n">hist</span><span class="p">):</span>
    <span class="n">heap</span> <span class="o">=</span> <span class="p">[(</span><span class="n">n</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">hist</span><span class="p">.</span><span class="n">items</span><span class="p">()]</span>
    <span class="n">heapq</span><span class="p">.</span><span class="n">heapify</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span>
    <span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappop</span><span class="p">(</span><span class="n">heap</span><span class="p">),</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappop</span><span class="p">(</span><span class="n">heap</span><span class="p">)</span>
        <span class="n">node</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
        <span class="n">heapq</span><span class="p">.</span><span class="n">heappush</span><span class="p">(</span><span class="n">heap</span><span class="p">,</span> <span class="n">node</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">heap</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>

<span class="n">tree</span> <span class="o">=</span> <span class="n">huffman</span><span class="p">(</span><span class="n">hist</span><span class="p">)</span>
</code></pre></div></div>

<p>(By the way, I love that <code class="language-plaintext highlighter-rouge">heapq</code> operates directly on a plain <code class="language-plaintext highlighter-rouge">list</code>
rather than being its own data structure.) This produces the following
Huffman coding tree (via <code class="language-plaintext highlighter-rouge">pprint</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((('e', 's'),
  (('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
 ((('i', ('p', 'c')),
   ('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
  (('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))
</code></pre></div></div>

<p>It would be more useful to actually see the encodings.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="s">""</span><span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">prefix</span><span class="o">+</span><span class="s">"0"</span><span class="p">)</span> <span class="o">+</span> \
               <span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">prefix</span><span class="o">+</span><span class="s">"1"</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="p">[(</span><span class="n">tree</span><span class="p">,</span> <span class="n">prefix</span><span class="p">)]</span>
</code></pre></div></div>

<p>I used <code class="language-plaintext highlighter-rouge">isinstance</code> to distinguish leaves (<code class="language-plaintext highlighter-rouge">str</code>) from internal nodes
(<code class="language-plaintext highlighter-rouge">tuple</code>). With <code class="language-plaintext highlighter-rouge">sorted(flatten(tree))</code>, I get something like Morse Code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('a', '1110'),       ('j', '10111110'),   ('s', '001'),
 ('b', '111111'),     ('k', '111110'),     ('t', '0100'),
 ('c', '10011'),      ('l', '0101'),       ('u', '11011'),
 ('d', '11010'),      ('m', '01111'),      ('v', '011010'),
 ('e', '000'),        ('n', '11110'),      ('w', '011011'),
 ('f', '101110'),     ('o', '1100'),       ('x', '101111111'),
 ('g', '01100'),      ('p', '10010'),      ('y', '10110'),
 ('h', '01110'),      ('q', '101111110'),  ('z', '1011110')]
 ('i', '1000'),       ('r', '1010'),
</code></pre></div></div>

<p>In terms of encoded bit length, what is the shortest and longest?</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">codes</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">))</span>
<span class="n">lengths</span> <span class="o">=</span> <span class="p">[(</span><span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">c</span><span class="p">])</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">w</span><span class="p">),</span> <span class="n">w</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">min(lengths)</code> is “esses” at 15 bits, and <code class="language-plaintext highlighter-rouge">max(lengths)</code> is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: <code class="language-plaintext highlighter-rouge">sum(w[0] for w in lengths)</code>
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.</p>

<p>Speaking of bit packing, I’m ready to compress the entire word list into a
bit stream:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bits</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">w</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">)</span>
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">bits</code> begins with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11101110011100001101011101110010110001000111011101...
</code></pre></div></div>

<p>On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused <code class="language-plaintext highlighter-rouge">textwrap</code> to dice it up, and I also need to reverse each
set of bits before converting to an integer.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u32</span> <span class="o">=</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">b</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">textwrap</span><span class="p">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">bits</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">32</span><span class="p">)]</span>
</code></pre></div></div>

<p>I now have my compressed data as a sequence of 32-bit integers. Next, some
meta-programming:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"static const uint32_t words[</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">u32</span><span class="p">)</span><span class="si">}</span><span class="s">] ="</span><span class="p">,</span> <span class="s">"{"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">u</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">u32</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">6</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">    "</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"0x</span><span class="si">{</span><span class="n">u</span><span class="si">:</span><span class="mi">08</span><span class="n">x</span><span class="si">}</span><span class="s">,"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">};"</span><span class="p">)</span>
</code></pre></div></div>

<p>That produces a C table, the beginnings of my decompressor. The array
length isn’t necessary since the C compiler can figure it out, but being
explicit allows human readers to know the size at a glance, too. Observe
how the final 32-bit integer isn’t entirely filled.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">words</span><span class="p">[</span><span class="mi">8812</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mh">0x4eeb0e77</span><span class="p">,</span><span class="mh">0xb8caee23</span><span class="p">,</span><span class="mh">0xffb892bb</span><span class="p">,</span><span class="mh">0x397fddf2</span><span class="p">,</span><span class="mh">0xddfcbfee</span><span class="p">,</span><span class="mh">0x5ff7997f</span><span class="p">,</span>
    <span class="c1">// ...</span>
    <span class="mh">0x7b4e66bd</span><span class="p">,</span><span class="mh">0x35ebcccd</span><span class="p">,</span><span class="mh">0x8f9af60f</span><span class="p">,</span><span class="mh">0x0000000c</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Now, how to go about building the rest of the decompressor? I have a
Huffman coding tree, which is <em>an awful lot</em> <a href="/blog/2020/12/31/">like a state machine</a>,
eh? I can even have Python generate a state transition table from the
Huffman tree:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">states</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
    <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">):</span>
        <span class="n">child</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
        <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="n">child</span>
        <span class="n">states</span><span class="p">.</span><span class="n">extend</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">))</span>
        <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">states</span><span class="p">,</span> <span class="n">child</span><span class="o">+</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">states</span><span class="p">,</span> <span class="n">child</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="nb">ord</span><span class="p">(</span><span class="n">tree</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">states</span>

<span class="n">states</span> <span class="o">=</span> <span class="n">transitions</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="p">[</span><span class="bp">None</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In <code class="language-plaintext highlighter-rouge">transitions</code>, the caller reserves
space in the state table for callees, hence starting with <code class="language-plaintext highlighter-rouge">[None]</code>. I’ll
show the actual table in C form after some more meta-programming:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"static const int8_t states[</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">states</span><span class="p">)</span><span class="si">}</span><span class="s">] ="</span><span class="p">,</span> <span class="s">"{"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">states</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">12</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">    "</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">s</span><span class="si">:</span><span class="mi">4</span><span class="si">}</span><span class="s">,"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">};"</span><span class="p">)</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">int8_t</code> since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int8_t</span> <span class="n">states</span><span class="p">[</span><span class="mi">51</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
      <span class="o">-</span><span class="mi">1</span><span class="p">,</span>  <span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">19</span><span class="p">,</span>  <span class="o">-</span><span class="mi">5</span><span class="p">,</span>  <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="mi">101</span><span class="p">,</span> <span class="mi">115</span><span class="p">,</span>  <span class="o">-</span><span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">11</span><span class="p">,</span> <span class="mi">116</span><span class="p">,</span> <span class="mi">108</span><span class="p">,</span> <span class="o">-</span><span class="mi">13</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">17</span><span class="p">,</span> <span class="mi">103</span><span class="p">,</span> <span class="o">-</span><span class="mi">15</span><span class="p">,</span> <span class="mi">118</span><span class="p">,</span> <span class="mi">119</span><span class="p">,</span> <span class="mi">104</span><span class="p">,</span> <span class="mi">109</span><span class="p">,</span> <span class="o">-</span><span class="mi">21</span><span class="p">,</span> <span class="o">-</span><span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">27</span><span class="p">,</span> <span class="mi">105</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">25</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span>  <span class="mi">99</span><span class="p">,</span> <span class="mi">114</span><span class="p">,</span> <span class="o">-</span><span class="mi">29</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="o">-</span><span class="mi">31</span><span class="p">,</span> <span class="mi">102</span><span class="p">,</span> <span class="o">-</span><span class="mi">33</span><span class="p">,</span> <span class="mi">122</span><span class="p">,</span> <span class="o">-</span><span class="mi">35</span><span class="p">,</span> <span class="mi">106</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">37</span><span class="p">,</span> <span class="mi">113</span><span class="p">,</span> <span class="mi">120</span><span class="p">,</span> <span class="o">-</span><span class="mi">41</span><span class="p">,</span> <span class="o">-</span><span class="mi">45</span><span class="p">,</span> <span class="mi">111</span><span class="p">,</span> <span class="o">-</span><span class="mi">43</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">117</span><span class="p">,</span>  <span class="mi">97</span><span class="p">,</span> <span class="o">-</span><span class="mi">47</span><span class="p">,</span> <span class="mi">110</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">49</span><span class="p">,</span> <span class="mi">107</span><span class="p">,</span>  <span class="mi">98</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The first node is -1, meaning if you read a 0 bit then transition to state
1, else state 2 (e.g. immediately following 1). The decompressor reads one
bit at a time, walking the state table until it hits a positive value,
which is an ASCII code. I’ve decided on this function prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">next</span><span class="p">(</span><span class="kt">char</span> <span class="n">word</span><span class="p">[</span><span class="mi">5</span><span class="p">],</span> <span class="kt">int32_t</span> <span class="n">n</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">n</code> is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int32_t</span> <span class="nf">next</span><span class="p">(</span><span class="kt">char</span> <span class="n">word</span><span class="p">[</span><span class="mi">5</span><span class="p">],</span> <span class="kt">int32_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">5</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">n</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">n</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>  <span class="c1">// next bit</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="n">word</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When compiled, this is about 80 bytes of instructions, both x86-64 and
ARM64. This, along with the 51 bytes for the state table, should be
counted against the compression size. That’s 35,579 bytes total.</p>

<p>Trying it out, this program indeed reproduces the original word list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">word</span><span class="p">[]</span> <span class="o">=</span> <span class="s">".....</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">12972</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span> <span class="o">=</span> <span class="n">next</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">state</span><span class="p">);</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with <code class="language-plaintext highlighter-rouge">b</code> is at bit offset 15,743. If the
word I’m looking up begins with <code class="language-plaintext highlighter-rouge">b</code> then I can start there and stop at the
first <code class="language-plaintext highlighter-rouge">c</code>, decompressing just 909 words.</p>

<h3 id="taking-it-to-the-next-level-run-length-encoding">Taking it to the next level: run-length encoding</h3>

<p>Here’s the 100-word word list sample again. The sorting is deliberate:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
</code></pre></div></div>

<p>If I look at words column-wise, I see a long run of <code class="language-plaintext highlighter-rouge">a</code>, then a long run
of <code class="language-plaintext highlighter-rouge">b</code>, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)</p>

<p>For this second scheme, I’ll pivot the whole list so that I can encode it
in column-order. (This is roughly how one part of bzip2 works, by the
way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’,
etc.”, then I’ll encode these RLE tokens using Huffman coding, per the
first scheme, since there will be lots of repeated tokens.</p>

<p>First, pivot the word list:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pivot</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>

<p>Next compute the RLE token stream. The stream works in pairs, first
indicating a letter (1–26), then the run length.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokens</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">offset</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot</span><span class="p">):</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">pivot</span><span class="p">[</span><span class="n">offset</span><span class="p">]</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">offset</span>
    <span class="k">while</span> <span class="n">offset</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot</span><span class="p">)</span> <span class="ow">and</span> <span class="n">pivot</span><span class="p">[</span><span class="n">offset</span><span class="p">]</span> <span class="o">==</span> <span class="n">c</span><span class="p">:</span>
        <span class="n">offset</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">-</span> <span class="nb">ord</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">tokens</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">offset</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
</code></pre></div></div>

<p>I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 —
since I’m going to encode all the tokens using the same Huffman tree.
(Exercise for the reader: Does compression improve with two distinct
Huffman trees, one for letters and the other for runs?) There are no
zero-length runs, and I want there to be as few unique tokens as possible.</p>

<p><code class="language-plaintext highlighter-rouge">tokens</code> looks like so (e.g. 737 ‘a’, 909 ‘b’, …):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]
</code></pre></div></div>

<p>The original Wordle list results in 139 unique tokens. A few tokens appear
many times, but most of appear only once. Reusing my Huffman coding tree
builder from before:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tree</span> <span class="o">=</span> <span class="n">huffman</span><span class="p">(</span><span class="n">collections</span><span class="p">.</span><span class="n">Counter</span><span class="p">(</span><span class="n">tokens</span><span class="p">))</span>
</code></pre></div></div>

<p>This makes for a more complex and interesting tree:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1,
 ((((18, 20), (25, (((10, 24), (26, 22)), 8))),
   (5,
    ((11,
      ((23,
        ((17,
          (((35, (46, 76)), ((82, 93), (104, 111))),
           (((165, 168), 27), (28, (((30, 39), 31), 38))))),
         ((((((40, 41), ((44, 48), 45)),
             ((53, (54, 56)), 55)),
            ((((57, 59), 58), ((60, 61), (62, 63))),
             ((64, (65, 66)), ((67, 70), 68)))),
           (((((71, 75), 74), (77, (78, 79))),
             (((80, 85), 87), 81)),
            ((((90, 91), (92, 97)), (96, (99, 100))),
             (((101, 103), 102),
              ((105, 106), (109, 110)))))),
          ((((((113, 114), 117), ((120, 121), (125, 129))),
             (((130, 133), (137, 139)), (138, (140, 142)))),
            ((((144, 145), (147, 153)), (148, (166, 175))),
             (((181, 183), (187, 189)),
              ((193, 202), (220, 242))))),
           (((((262, 303), (325, 376)),
              ((413, 489), (577, 598))),
             (((628, 638), (685, 693)),
              ((737, 815), (859, 909)))),
            ((((922, 1565), 29), 32), (34, (33, 43)))))))),
       6)),
     3))),
  ((19, 2),
   ((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))
</code></pre></div></div>

<p>Peeking at the first 21 elements of <code class="language-plaintext highlighter-rouge">sorted(flatten(tree))</code>, which chops
off the long tail of large-valued, single-occurrence tokens:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(1, '0'),            (8, '100111'),       (15, '111010'),
 (2, '1101'),         (9, '111101'),       (16, '1110111'),
 (3, '10111'),        (10, '10011000'),    (17, '1011010100'),
 (4, '11100'),        (11, '101100'),      (18, '10000'),
 (5, '1010'),         (12, '111110'),      (19, '1100'),
 (6, '1011011'),      (13, '1111110'),     (20, '10001'),
 (7, '1111111'),      (14, '111100'),      (21, '1110110')]
</code></pre></div></div>

<p>Huffman-encoding the RLE stream is more straightforward:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">codes</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">flatten</span><span class="p">(</span><span class="n">tree</span><span class="p">))</span>
<span class="n">bits</span> <span class="o">=</span> <span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">codes</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokens</span><span class="p">)</span>
</code></pre></div></div>

<p>This time <code class="language-plaintext highlighter-rouge">len(bits)</code> is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!</p>

<p>Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be <code class="language-plaintext highlighter-rouge">int16_t</code>. I copy-pasted the
original meta-programming code and make the appropriate adjustments:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int16_t</span> <span class="n">states</span><span class="p">[</span><span class="mi">277</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
      <span class="o">-</span><span class="mi">1</span><span class="p">,</span>   <span class="mi">1</span><span class="p">,</span>  <span class="o">-</span><span class="mi">3</span><span class="p">,</span>  <span class="o">-</span><span class="mi">5</span><span class="p">,</span><span class="o">-</span><span class="mi">257</span><span class="p">,</span>  <span class="o">-</span><span class="mi">7</span><span class="p">,</span> <span class="o">-</span><span class="mi">21</span><span class="p">,</span>  <span class="o">-</span><span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">11</span><span class="p">,</span>  <span class="mi">18</span><span class="p">,</span>  <span class="mi">20</span><span class="p">,</span>  <span class="mi">25</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">13</span><span class="p">,</span> <span class="o">-</span><span class="mi">15</span><span class="p">,</span>   <span class="mi">8</span><span class="p">,</span> <span class="o">-</span><span class="mi">17</span><span class="p">,</span> <span class="o">-</span><span class="mi">19</span><span class="p">,</span>  <span class="mi">10</span><span class="p">,</span>  <span class="mi">24</span><span class="p">,</span>  <span class="mi">26</span><span class="p">,</span>  <span class="mi">22</span><span class="p">,</span>   <span class="mi">5</span><span class="p">,</span> <span class="o">-</span><span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">25</span><span class="p">,</span>
       <span class="mi">3</span><span class="p">,</span>  <span class="mi">11</span><span class="p">,</span> <span class="o">-</span><span class="mi">27</span><span class="p">,</span> <span class="o">-</span><span class="mi">29</span><span class="p">,</span>   <span class="mi">6</span><span class="p">,</span>  <span class="mi">23</span><span class="p">,</span> <span class="o">-</span><span class="mi">31</span><span class="p">,</span> <span class="o">-</span><span class="mi">33</span><span class="p">,</span> <span class="o">-</span><span class="mi">63</span><span class="p">,</span>  <span class="mi">17</span><span class="p">,</span> <span class="o">-</span><span class="mi">35</span><span class="p">,</span> <span class="o">-</span><span class="mi">37</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">49</span><span class="p">,</span> <span class="o">-</span><span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">43</span><span class="p">,</span>  <span class="mi">35</span><span class="p">,</span> <span class="o">-</span><span class="mi">41</span><span class="p">,</span>  <span class="mi">46</span><span class="p">,</span>  <span class="mi">76</span><span class="p">,</span> <span class="o">-</span><span class="mi">45</span><span class="p">,</span> <span class="o">-</span><span class="mi">47</span><span class="p">,</span>  <span class="mi">82</span><span class="p">,</span>  <span class="mi">93</span><span class="p">,</span> <span class="mi">104</span><span class="p">,</span>
     <span class="mi">111</span><span class="p">,</span> <span class="o">-</span><span class="mi">51</span><span class="p">,</span> <span class="o">-</span><span class="mi">55</span><span class="p">,</span> <span class="o">-</span><span class="mi">53</span><span class="p">,</span>  <span class="mi">27</span><span class="p">,</span> <span class="mi">165</span><span class="p">,</span> <span class="mi">168</span><span class="p">,</span>  <span class="mi">28</span><span class="p">,</span> <span class="o">-</span><span class="mi">57</span><span class="p">,</span> <span class="o">-</span><span class="mi">59</span><span class="p">,</span>  <span class="mi">38</span><span class="p">,</span> <span class="o">-</span><span class="mi">61</span><span class="p">,</span>
      <span class="mi">31</span><span class="p">,</span>  <span class="mi">30</span><span class="p">,</span>  <span class="mi">39</span><span class="p">,</span> <span class="o">-</span><span class="mi">65</span><span class="p">,</span><span class="o">-</span><span class="mi">155</span><span class="p">,</span> <span class="o">-</span><span class="mi">67</span><span class="p">,</span><span class="o">-</span><span class="mi">109</span><span class="p">,</span> <span class="o">-</span><span class="mi">69</span><span class="p">,</span> <span class="o">-</span><span class="mi">85</span><span class="p">,</span> <span class="o">-</span><span class="mi">71</span><span class="p">,</span> <span class="o">-</span><span class="mi">79</span><span class="p">,</span> <span class="o">-</span><span class="mi">73</span><span class="p">,</span>
     <span class="o">-</span><span class="mi">75</span><span class="p">,</span>  <span class="mi">40</span><span class="p">,</span>  <span class="mi">41</span><span class="p">,</span> <span class="o">-</span><span class="mi">77</span><span class="p">,</span>  <span class="mi">45</span><span class="p">,</span>  <span class="mi">44</span><span class="p">,</span>  <span class="mi">48</span><span class="p">,</span> <span class="o">-</span><span class="mi">81</span><span class="p">,</span>  <span class="mi">55</span><span class="p">,</span>  <span class="mi">53</span><span class="p">,</span> <span class="o">-</span><span class="mi">83</span><span class="p">,</span>  <span class="mi">54</span><span class="p">,</span>
      <span class="mi">56</span><span class="p">,</span> <span class="o">-</span><span class="mi">87</span><span class="p">,</span> <span class="o">-</span><span class="mi">99</span><span class="p">,</span> <span class="o">-</span><span class="mi">89</span><span class="p">,</span> <span class="o">-</span><span class="mi">93</span><span class="p">,</span> <span class="o">-</span><span class="mi">91</span><span class="p">,</span>  <span class="mi">58</span><span class="p">,</span>  <span class="mi">57</span><span class="p">,</span>  <span class="mi">59</span><span class="p">,</span> <span class="o">-</span><span class="mi">95</span><span class="p">,</span> <span class="o">-</span><span class="mi">97</span><span class="p">,</span>  <span class="mi">60</span><span class="p">,</span>
      <span class="mi">61</span><span class="p">,</span>  <span class="mi">62</span><span class="p">,</span>  <span class="mi">63</span><span class="p">,</span><span class="o">-</span><span class="mi">101</span><span class="p">,</span><span class="o">-</span><span class="mi">105</span><span class="p">,</span>  <span class="mi">64</span><span class="p">,</span><span class="o">-</span><span class="mi">103</span><span class="p">,</span>  <span class="mi">65</span><span class="p">,</span>  <span class="mi">66</span><span class="p">,</span><span class="o">-</span><span class="mi">107</span><span class="p">,</span>  <span class="mi">68</span><span class="p">,</span>  <span class="mi">67</span><span class="p">,</span>
      <span class="mi">70</span><span class="p">,</span><span class="o">-</span><span class="mi">111</span><span class="p">,</span><span class="o">-</span><span class="mi">129</span><span class="p">,</span><span class="o">-</span><span class="mi">113</span><span class="p">,</span><span class="o">-</span><span class="mi">123</span><span class="p">,</span><span class="o">-</span><span class="mi">115</span><span class="p">,</span><span class="o">-</span><span class="mi">119</span><span class="p">,</span><span class="o">-</span><span class="mi">117</span><span class="p">,</span>  <span class="mi">74</span><span class="p">,</span>  <span class="mi">71</span><span class="p">,</span>  <span class="mi">75</span><span class="p">,</span>  <span class="mi">77</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">121</span><span class="p">,</span>  <span class="mi">78</span><span class="p">,</span>  <span class="mi">79</span><span class="p">,</span><span class="o">-</span><span class="mi">125</span><span class="p">,</span>  <span class="mi">81</span><span class="p">,</span><span class="o">-</span><span class="mi">127</span><span class="p">,</span>  <span class="mi">87</span><span class="p">,</span>  <span class="mi">80</span><span class="p">,</span>  <span class="mi">85</span><span class="p">,</span><span class="o">-</span><span class="mi">131</span><span class="p">,</span><span class="o">-</span><span class="mi">143</span><span class="p">,</span><span class="o">-</span><span class="mi">133</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">139</span><span class="p">,</span><span class="o">-</span><span class="mi">135</span><span class="p">,</span><span class="o">-</span><span class="mi">137</span><span class="p">,</span>  <span class="mi">90</span><span class="p">,</span>  <span class="mi">91</span><span class="p">,</span>  <span class="mi">92</span><span class="p">,</span>  <span class="mi">97</span><span class="p">,</span>  <span class="mi">96</span><span class="p">,</span><span class="o">-</span><span class="mi">141</span><span class="p">,</span>  <span class="mi">99</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span><span class="o">-</span><span class="mi">145</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">149</span><span class="p">,</span><span class="o">-</span><span class="mi">147</span><span class="p">,</span> <span class="mi">102</span><span class="p">,</span> <span class="mi">101</span><span class="p">,</span> <span class="mi">103</span><span class="p">,</span><span class="o">-</span><span class="mi">151</span><span class="p">,</span><span class="o">-</span><span class="mi">153</span><span class="p">,</span> <span class="mi">105</span><span class="p">,</span> <span class="mi">106</span><span class="p">,</span> <span class="mi">109</span><span class="p">,</span> <span class="mi">110</span><span class="p">,</span><span class="o">-</span><span class="mi">157</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">213</span><span class="p">,</span><span class="o">-</span><span class="mi">159</span><span class="p">,</span><span class="o">-</span><span class="mi">185</span><span class="p">,</span><span class="o">-</span><span class="mi">161</span><span class="p">,</span><span class="o">-</span><span class="mi">173</span><span class="p">,</span><span class="o">-</span><span class="mi">163</span><span class="p">,</span><span class="o">-</span><span class="mi">167</span><span class="p">,</span><span class="o">-</span><span class="mi">165</span><span class="p">,</span> <span class="mi">117</span><span class="p">,</span> <span class="mi">113</span><span class="p">,</span> <span class="mi">114</span><span class="p">,</span><span class="o">-</span><span class="mi">169</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">171</span><span class="p">,</span> <span class="mi">120</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="mi">125</span><span class="p">,</span> <span class="mi">129</span><span class="p">,</span><span class="o">-</span><span class="mi">175</span><span class="p">,</span><span class="o">-</span><span class="mi">181</span><span class="p">,</span><span class="o">-</span><span class="mi">177</span><span class="p">,</span><span class="o">-</span><span class="mi">179</span><span class="p">,</span> <span class="mi">130</span><span class="p">,</span> <span class="mi">133</span><span class="p">,</span> <span class="mi">137</span><span class="p">,</span>
     <span class="mi">139</span><span class="p">,</span> <span class="mi">138</span><span class="p">,</span><span class="o">-</span><span class="mi">183</span><span class="p">,</span> <span class="mi">140</span><span class="p">,</span> <span class="mi">142</span><span class="p">,</span><span class="o">-</span><span class="mi">187</span><span class="p">,</span><span class="o">-</span><span class="mi">199</span><span class="p">,</span><span class="o">-</span><span class="mi">189</span><span class="p">,</span><span class="o">-</span><span class="mi">195</span><span class="p">,</span><span class="o">-</span><span class="mi">191</span><span class="p">,</span><span class="o">-</span><span class="mi">193</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span>
     <span class="mi">145</span><span class="p">,</span> <span class="mi">147</span><span class="p">,</span> <span class="mi">153</span><span class="p">,</span> <span class="mi">148</span><span class="p">,</span><span class="o">-</span><span class="mi">197</span><span class="p">,</span> <span class="mi">166</span><span class="p">,</span> <span class="mi">175</span><span class="p">,</span><span class="o">-</span><span class="mi">201</span><span class="p">,</span><span class="o">-</span><span class="mi">207</span><span class="p">,</span><span class="o">-</span><span class="mi">203</span><span class="p">,</span><span class="o">-</span><span class="mi">205</span><span class="p">,</span> <span class="mi">181</span><span class="p">,</span>
     <span class="mi">183</span><span class="p">,</span> <span class="mi">187</span><span class="p">,</span> <span class="mi">189</span><span class="p">,</span><span class="o">-</span><span class="mi">209</span><span class="p">,</span><span class="o">-</span><span class="mi">211</span><span class="p">,</span> <span class="mi">193</span><span class="p">,</span> <span class="mi">202</span><span class="p">,</span> <span class="mi">220</span><span class="p">,</span> <span class="mi">242</span><span class="p">,</span><span class="o">-</span><span class="mi">215</span><span class="p">,</span><span class="o">-</span><span class="mi">245</span><span class="p">,</span><span class="o">-</span><span class="mi">217</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">231</span><span class="p">,</span><span class="o">-</span><span class="mi">219</span><span class="p">,</span><span class="o">-</span><span class="mi">225</span><span class="p">,</span><span class="o">-</span><span class="mi">221</span><span class="p">,</span><span class="o">-</span><span class="mi">223</span><span class="p">,</span> <span class="mi">262</span><span class="p">,</span> <span class="mi">303</span><span class="p">,</span> <span class="mi">325</span><span class="p">,</span> <span class="mi">376</span><span class="p">,</span><span class="o">-</span><span class="mi">227</span><span class="p">,</span><span class="o">-</span><span class="mi">229</span><span class="p">,</span> <span class="mi">413</span><span class="p">,</span>
     <span class="mi">489</span><span class="p">,</span> <span class="mi">577</span><span class="p">,</span> <span class="mi">598</span><span class="p">,</span><span class="o">-</span><span class="mi">233</span><span class="p">,</span><span class="o">-</span><span class="mi">239</span><span class="p">,</span><span class="o">-</span><span class="mi">235</span><span class="p">,</span><span class="o">-</span><span class="mi">237</span><span class="p">,</span> <span class="mi">628</span><span class="p">,</span> <span class="mi">638</span><span class="p">,</span> <span class="mi">685</span><span class="p">,</span> <span class="mi">693</span><span class="p">,</span><span class="o">-</span><span class="mi">241</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">243</span><span class="p">,</span> <span class="mi">737</span><span class="p">,</span> <span class="mi">815</span><span class="p">,</span> <span class="mi">859</span><span class="p">,</span> <span class="mi">909</span><span class="p">,</span><span class="o">-</span><span class="mi">247</span><span class="p">,</span><span class="o">-</span><span class="mi">253</span><span class="p">,</span><span class="o">-</span><span class="mi">249</span><span class="p">,</span>  <span class="mi">32</span><span class="p">,</span><span class="o">-</span><span class="mi">251</span><span class="p">,</span>  <span class="mi">29</span><span class="p">,</span> <span class="mi">922</span><span class="p">,</span>
    <span class="mi">1565</span><span class="p">,</span>  <span class="mi">34</span><span class="p">,</span><span class="o">-</span><span class="mi">255</span><span class="p">,</span>  <span class="mi">33</span><span class="p">,</span>  <span class="mi">43</span><span class="p">,</span><span class="o">-</span><span class="mi">259</span><span class="p">,</span><span class="o">-</span><span class="mi">261</span><span class="p">,</span>  <span class="mi">19</span><span class="p">,</span>   <span class="mi">2</span><span class="p">,</span><span class="o">-</span><span class="mi">263</span><span class="p">,</span><span class="o">-</span><span class="mi">269</span><span class="p">,</span>   <span class="mi">4</span><span class="p">,</span>
    <span class="o">-</span><span class="mi">265</span><span class="p">,</span>  <span class="mi">15</span><span class="p">,</span><span class="o">-</span><span class="mi">267</span><span class="p">,</span>  <span class="mi">21</span><span class="p">,</span>  <span class="mi">16</span><span class="p">,</span><span class="o">-</span><span class="mi">271</span><span class="p">,</span><span class="o">-</span><span class="mi">273</span><span class="p">,</span>  <span class="mi">14</span><span class="p">,</span>   <span class="mi">9</span><span class="p">,</span>  <span class="mi">12</span><span class="p">,</span><span class="o">-</span><span class="mi">275</span><span class="p">,</span>  <span class="mi">13</span><span class="p">,</span>
       <span class="mi">7</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>(Since 277 is prime it will never wrap to a nice rectangle no matter what
width I plug in. Ugh.)</p>

<p>With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a <code class="language-plaintext highlighter-rouge">12972*5</code>-byte buffer to be
filled:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">decompress</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a <code class="language-plaintext highlighter-rouge">12972*3</code>-byte buffer.</p>

<p>Here’s my decoder, much like before:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">decompress</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int32_t</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">164958</span><span class="p">;)</span> <span class="p">{</span>
        <span class="c1">// Decode letter</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">i</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">+</span> <span class="mi">96</span><span class="p">;</span>

        <span class="c1">// Decode run-length</span>
        <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(;</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="o">&gt;&gt;</span><span class="mi">5</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="n">i</span><span class="o">&amp;</span><span class="mi">31</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">b</span> <span class="o">-</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">state</span><span class="p">];</span>

        <span class="c1">// Fill columns</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">buf</span><span class="p">[</span><span class="n">y</span><span class="o">*</span><span class="mi">5</span><span class="o">+</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">12972</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
            <span class="n">x</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And my new test exactly reproduces the original list:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">12972</span><span class="o">*</span><span class="mi">5L</span><span class="p">];</span>
    <span class="n">decompress</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>

    <span class="kt">char</span> <span class="n">word</span><span class="p">[]</span> <span class="o">=</span> <span class="s">".....</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">12972</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">buf</span><span class="o">+</span><span class="n">i</span><span class="o">*</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">);</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Totalling it up:</p>

<ul>
  <li>Compressed data is 20,620 bytes</li>
  <li>State table is 554 bytes</li>
  <li>Decompressor is about 200 bytes</li>
</ul>

<p>That’s a total of 21,374 bytes. Surprisingly this beats general purpose
compressors!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PROGRAM     VERSION   SIZE
bzip2 -9    1.0.8     33,752
gzip -9     1.10      30,338
zstd -19    1.4.8     27,098
brotli -9   1.0.9     26,031
xz -9e      5.2.5     16,656
lzip -9     1.22      16,608
</code></pre></div></div>

<p>Only <code class="language-plaintext highlighter-rouge">xz</code> and <code class="language-plaintext highlighter-rouge">lzip</code> come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.</p>

<p><em>Update</em>: <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCAKF7Hnc4nVKS%3D2adUjyiRb5yBZUdw5z0K_Fb9kFbaW5S6i7POw%40mail.gmail.com%3E">Johannes Rudolph has pointed out</a> a compression scheme for
a Game Boy Wordle clone last month that gets it <a href="http://alexanderpruss.blogspot.com/2022/02/game-boy-wordle-how-to-compress-12972.html">down to 17,871 bytes,
<em>and</em> supports iteration</a>. I improved on this scheme to <a href="https://github.com/skeeto/scratch/blob/master/misc/wordle.c">further
reduce it to 16,659 bytes</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The wild west of Windows command line parsing</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/02/18/"/>
    <id>urn:uuid:04c886e0-3434-4292-b7de-e8213461838c</id>
    <updated>2022-02-18T03:52:12Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>I’ve been experimenting again lately with <a href="/blog/2016/01/31/">writing software without a
runtime</a> aside from the operating system itself, both on Linux and
Windows. Another way to look at it: I write and embed a bespoke, minimal
runtime within the application. One of the runtime’s core jobs is
retrieving command line arguments from the operating system. On Windows
this is a deeper rabbit hole than I expected, and far more complex than I
realized. There is no standard, and every runtime does it a little
differently. Five different applications may see five different sets of
arguments — even different argument counts — from the same input, and this
is <em>before</em> any sort of option parsing. It’s truly a modern day Tower of
Babel: “Confound their command line parsing, that they may not understand
one another’s arguments.”</p>

<p>Unix-like systems pass the <code class="language-plaintext highlighter-rouge">argv</code> array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it <code class="language-plaintext highlighter-rouge">argv</code>. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">]</span>     <span class="c1">; argc</span>
        <span class="nf">lea</span>   <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mi">8</span><span class="p">]</span>   <span class="c1">; argv</span>
        <span class="nf">call</span>  <span class="nv">main</span>
        <span class="nf">mov</span>   <span class="nb">edi</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="mi">60</span>        <span class="c1">; SYS_exit</span>
        <span class="nf">syscall</span>
</code></pre></div></div>

<p>It’s 5 instructions (20 bytes) on ARM64:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">_start:</span> <span class="nf">ldr</span>  <span class="nv">w0</span><span class="p">,</span> <span class="p">[</span><span class="nb">sp</span><span class="p">]</span>        <span class="c1">; argc</span>
        <span class="nf">add</span>  <span class="nv">x1</span><span class="p">,</span> <span class="nb">sp</span><span class="p">,</span> <span class="mi">8</span>       <span class="c1">; argv</span>
        <span class="nf">bl</span>   <span class="nv">main</span>
        <span class="nf">mov</span>  <span class="nv">w8</span><span class="p">,</span> <span class="mi">93</span>          <span class="c1">; SYS_exit</span>
        <span class="nf">svc</span>  <span class="mi">0</span>
</code></pre></div></div>

<p>On Windows, <code class="language-plaintext highlighter-rouge">argv</code> is passed in serialized form as a string. That’s how
MS-DOS did it (via the <a href="https://en.wikipedia.org/wiki/Program_Segment_Prefix">Program Segment Prefix</a>), because <a href="http://www.gaby.de/cpm/manuals/archive/cpm22htm/ch5.htm">that’s how
CP/M did it</a>. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and <em>somebody</em> has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.</p>

<p>Windows itself provides an operating system routine for parsing command
line strings: <a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw">CommandLineToArgvW</a>. Fetch the command line string
with <a href="https://docs.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew">GetCommandLineW</a>, pass it to this function, and you have your
<code class="language-plaintext highlighter-rouge">argc</code> and <code class="language-plaintext highlighter-rouge">argv</code>. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so <a href="/blog/2021/12/30/">if you want to work in UTF-8</a> you’ll also need
<code class="language-plaintext highlighter-rouge">WideCharToMultiByte</code>. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.</p>

<h3 id="my-getcommandlinew">My GetCommandLineW</h3>

<p>GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the <a href="https://docs.microsoft.com/en-us/windows/win32/api/winternl/ns-winternl-peb">Process
Environment Block</a>. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the <a href="https://en.wikipedia.org/wiki/Win32_Thread_Information_Block">Thread Information Block</a>,
itself found via the <code class="language-plaintext highlighter-rouge">gs</code> register (x64, <code class="language-plaintext highlighter-rouge">fs</code> on x86), an <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">old 386 segment
register</a>. Buried in the PEB is a <a href="https://docs.microsoft.com/en-us/windows/win32/api/subauth/ns-subauth-unicode_string"><code class="language-plaintext highlighter-rouge">UNICODE_STRING</code></a>, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">wchar_t</span> <span class="o">*</span><span class="nf">cmdline_fetch</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="cp">#if __amd64
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%gs:(0x60), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x20(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x78(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#elif __i386
</span>    <span class="kr">__asm</span> <span class="p">(</span><span class="s">"mov %%fs:(0x30), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x10(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="s">"mov 0x44(%0), %0</span><span class="se">\n</span><span class="s">"</span>
           <span class="o">:</span> <span class="s">"=r"</span><span class="p">(</span><span class="n">cmd</span><span class="p">));</span>
    <span class="cp">#endif
</span>    <span class="k">return</span> <span class="n">cmd</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the <code class="language-plaintext highlighter-rouge">gs</code> register, and you’ll find it using these offsets, too.</p>

<p>If you look carefully at the <code class="language-plaintext highlighter-rouge">UNICODE_STRING</code> you’ll see the length is
given by a <code class="language-plaintext highlighter-rouge">USHORT</code> in units of bytes, despite being a 16-bit <code class="language-plaintext highlighter-rouge">wchar_t</code>
string. This is <a href="https://devblogs.microsoft.com/oldnewthing/20031210-00/?p=41553">the source</a> of Windows’ maximum command line length
of <a href="https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw">32,767 characters</a> (including terminator).</p>

<p>GetCommandLineW is from <code class="language-plaintext highlighter-rouge">kernel32.dll</code>, but CommandLineToArgvW is a bit
more off the beaten path in <code class="language-plaintext highlighter-rouge">shell32.dll</code>. If you wanted to avoid linking
to <code class="language-plaintext highlighter-rouge">shell32.dll</code> for <a href="https://randomascii.wordpress.com/2018/12/03/a-not-called-function-can-cause-a-5x-slowdown/">important reasons</a>, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.</p>

<p>The GetCommandLineW has a rough explanation: split arguments on whitespace
(not defined), quoting is involved, and there’s something about counting
backslashes, but only if they stop on a quote. It’s not quite enough to
implement your own, and if you test against it, it’s quickly apparent that
this documentation is at best incomplete. It links to a deprecated page
about <a href="https://docs.microsoft.com/en-us/previous-versions/17w5ykft(v=vs.85)">parsing C++ command line arguments</a> with a few more details.
Unfortunately the algorithm described on this page is not the algorithm
used by GetCommandLineW, nor is it used by any runtime I could find. It
even varies between Microsoft’s own CRTs. There is no canonical command
line parsing result, not even a <em>de facto</em> standard.</p>

<p>I eventually came across David Deley’s <a href="https://daviddeley.com/autohotkey/parameters/parameters.htm">How Command Line Parameters Are
Parsed</a>, which is the closest there is to an authoritative document on
the matter (<a href="https://web.archive.org/web/20210615061518/http://www.windowsinspired.com/how-a-windows-programs-splits-its-command-line-into-individual-arguments/">also</a>). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. <code class="language-plaintext highlighter-rouge">argv[0]</code>) follows entirely different
rules, which really confused me for while. The <a href="https://source.winehq.org/git/wine.git/blob/5a66eab72:/dlls/shcore/main.c#l264">Wine documentation</a>
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.</p>

<h3 id="my-commandlinetoargvw">My CommandLineToArgvW</h3>

<p>Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s <a href="/blog/2020/12/31/">a
little state machine</a>: <strong><a href="https://github.com/skeeto/scratch/blob/master/parsers/cmdline.c#L27"><code class="language-plaintext highlighter-rouge">cmdline.c</code></a></strong>. The interface:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_to_argv8</span><span class="p">(</span><span class="k">const</span> <span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike the others, mine encodes straight into <a href="https://simonsapin.github.io/wtf-8/">WTF-8</a>, a superset of
UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative
lines of code: invisible since it involves <em>not</em> reacting to ill-formed
input. If you use the new-ish UTF-8 manifest Win32 feature then your
program cannot handle command line strings with ill-formed UTF-16, a
problem solved by WTF-8.</p>

<p>As documented, that <code class="language-plaintext highlighter-rouge">argv</code> must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case <code class="language-plaintext highlighter-rouge">argc</code> is 16,383 (plus one more <code class="language-plaintext highlighter-rouge">argv</code> slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the <code class="language-plaintext highlighter-rouge">argv</code> is
actually used as a <code class="language-plaintext highlighter-rouge">char</code> buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[</span><span class="n">CMDLINE_ARGV_MAX</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">argc</span> <span class="o">=</span> <span class="n">cmdline_to_argv8</span><span class="p">(</span><span class="n">cmdline_fetch</span><span class="p">(),</span> <span class="n">argv</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">main</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also: Note the <code class="language-plaintext highlighter-rouge">FUZZ</code> option in my source. It has been pretty thoroughly
<a href="/blog/2019/01/25/">fuzz tested</a>. It didn’t find anything, but it does make me more
confident in the result.</p>

<p>I also peeked at some language runtimes to see how others handle it. Just
as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft
CRT. Also expected, CPython implicitly does whatever the underlying C
runtime does, so its exact command line behavior depends on which version
of Visual Studio was used to build the Python binary. OpenJDK
<a href="https://github.com/openjdk/jdk/blob/jdk-17+35/src/jdk.jpackage/windows/native/common/WinSysInfo.cpp#L141">pragmatically calls CommandLineToArgvW</a>. Go (gc) <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/os/exec_windows.go#115">does its own
parsing</a>, with behavior mixed between CommandLineToArgvW and some of
Microsoft’s CRTs, but not quite matching either.</p>

<h3 id="building-a-command-line-string">Building a command line string</h3>

<p>I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an <code class="language-plaintext highlighter-rouge">argv</code>. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.</p>

<p>Naturally, I also wrote my own implementation (same source):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmdline_from_argv8</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*</span><span class="n">cmdline</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">);</span>
</code></pre></div></div>

<p>Like before, it accepts a WTF-8 <code class="language-plaintext highlighter-rouge">argv</code>, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one <em>can</em> fail when <code class="language-plaintext highlighter-rouge">argv</code> is too large, it returns
zero for an error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="s">"python.exe"</span><span class="p">,</span> <span class="s">"-c"</span><span class="p">,</span> <span class="n">code</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
<span class="kt">wchar_t</span> <span class="n">cmd</span><span class="p">[</span><span class="n">CMDLINE_CMD_MAX</span><span class="p">];</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">cmdline_from_argv8</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">CMDLINE_CMD_MAX</span><span class="p">,</span> <span class="n">argv</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"argv too large"</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">CreateProcessW</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="cm">/*...*/</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s">"CreateProcessW failed"</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>How do others handle this?</p>

<ul>
  <li>
    <p>The <a href="https://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c?h=emacs-27.2#n2009">aged Emacs implementation</a> is written in C rather than Lisp,
steeped in history with vestigial wrong turns. Emacs still only calls
the “narrow” CreateProcessA despite having every affordance to do
otherwise, and <a href="https://github.com/skeeto/emacsql/issues/77#issuecomment-887125675">uses the wrong encoding at that</a>. A personal
source of headaches.</p>
  </li>
  <li>
    <p>CPython uses Python rather than C via <a href="https://github.com/python/cpython/blob/3.10/Lib/subprocess.py#L529"><code class="language-plaintext highlighter-rouge">subprocess.list2cmdline</code></a>.
While <a href="https://bugs.python.org/issue10838">undocumented</a>, it’s accessible on any platform and easy to
test against various inputs. Try it out!</p>
  </li>
  <li>
    <p>Go (gc) is <a href="https://go.googlesource.com/go/+/refs/tags/go1.17.7/src/syscall/exec_windows.go#101">just as delightfully boring I’d expect</a>.</p>
  </li>
  <li>
    <p>OpenJDK <a href="https://github.com/openjdk/jdk/blob/jdk-17%2B35/src/java.base/windows/classes/java/lang/ProcessImpl.java#L229">optimistically optimizes</a> for command line strings under
80 bytes, and like Emacs, displays the weathering of long use.</p>
  </li>
</ul>

<p>I don’t plan to write a language implementation anytime soon, where this
might be needed, but it’s nice to know I’ve already solved this problem
for myself!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A new protocol and tool for PNG file attachments</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/12/31/"/>
    <id>urn:uuid:30ae498b-881d-428e-97e5-7ea3cc332973</id>
    <updated>2021-12-31T22:17:26Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>When my articles include diagrams to illustrate a concept, such as a
<a href="/blog/2020/12/31/">state machine</a>, I will check the <a href="https://graphviz.org/">Graphviz</a>, <a href="http://www.gnuplot.info/">gnuplot</a>, or SVG
source into source control alongside the image in case I need to make
changes in the future. Sometimes I even make the image itself a link to
its source file. I’ve thought it would be convenient if the raster image
somehow contained its own source as metadata so that they don’t get
separated. I looked around and wasn’t satisfied with the solutions I
found, so I wrote one: <strong><a href="https://github.com/skeeto/scratch/tree/master/pngattach">pngattach</a></strong>.</p>

<p>My approach introduces a new private chunk type <code class="language-plaintext highlighter-rouge">atCh</code> (“attachment”)
which contains a file name, a flag to indicate if the attachment is
compressed, and an optionally DEFLATE-compressed blob of file contents. I
tried to follow the spirit of PNG chunk formatting, but without the
constraints I hoped to avoid. A single PNG can contain multiple
attachments, e.g. source file, Makefile, README, license file, etc. The
protocol places constraints on the file names to keep it simple and to
avoid shenanigans: no control bytes (anything below ASCII space), no
directories, and cannot start with a period (no special hidden files). If
that’s too constraining, you could attach a ZIP or TAR.</p>

<h3 id="png-chunk-format">PNG chunk format</h3>

<p>PNG files begin with a fixed 8-byte header followed by of a series of
chunks. Each chunk has an 8-byte header and 4-byte footer. The chunk
header is a 32-bit big endian chunk length (not counting header or footer)
and a 4-byte tag identifying its type. The length allows implementations
to skip chunks it doesn’t recognize.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LLLL TTTT ...chunk... CCCC
</code></pre></div></div>

<p>The footer is a big endian CRC-32 checksum of the 4-byte type tag and the
chunk body itself.</p>

<p>Chunk tags are interpreted as 4 ASCII characters, where the capitalization
of each letter encodes 4 additional boolean flags. The flags in my tag,
<code class="language-plaintext highlighter-rouge">atCh</code>, indicate it’s a non-critical private chunk which doesn’t depend on
the image data.</p>

<p>PNG always ends with a zero-length <code class="language-plaintext highlighter-rouge">IEND</code> chunk, which works out to a kind
of 12-byte constant footer.</p>

<h3 id="existing-chunk-types">Existing chunk types</h3>

<p>The PNG standard currently defines three kinds of chunks for storing text
metadata: <code class="language-plaintext highlighter-rouge">tEXt</code>, <code class="language-plaintext highlighter-rouge">iTXt</code>, <code class="language-plaintext highlighter-rouge">zTXt</code>. The first is limited to Latin-1 with LF
newlines, and so cannot store UTF-8 source text. The latter two were
introduced in the PNG 1.2 specification (November 1999), and allow (only)
UTF-8 content with LF newlines. All three have a 1 to 79-byte Latin-1
“key” field, and the latter two some additional fields describing the
language of the text.</p>

<p>The key field is null-terminated, making it 80 bytes maximum when treated
as a null-terminated string. I believe this constraint exists to aid
implementations, which can rely on this hard upper limit for the key
lengths they’re expected to handle. Otherwise a key could have been up to
4GiB in length.</p>

<p>I had considered using part of the key as a file name, prefixed with a
custom namespace (ex. <code class="language-plaintext highlighter-rouge">attachment:FILENAME</code>) to distinguish it from other
text chunks. However, I didn’t like the constraints this placed on the
file name, plus I wanted to support arbitrary file content, not limited to
a particular subformat.</p>

<p>As prior art, there’s a draw.io/diagrams.net format which embeds a source
string without file name. The source string is encoded in base64 (i.e.
unconstrained by PNG), wrapped in XML, then incorrectly encoded as an
<code class="language-plaintext highlighter-rouge">iTXt</code> chunk. The XML alone was enough to keep me away from using this
format.</p>

<h3 id="pngattach-details">pngattach details</h3>

<p>In my attachment protocol, the file name is an arbitrary length,
null-terminated byte string (preferably UTF-8), much like a key field,
with the previously-mentioned anti-shenanigans restrictions. The file name
is followed by a byte, 0 or 1, indicating if the content is compressed
using PNG’s officially-supported compression format. The rest is the
arbitrary content bytes, which presumably the recipient will know how to
use.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LLLL atCh example.txt 0 F ...contents... CCCC
</code></pre></div></div>

<p>I expect any experienced programmer could write a basic attachment
extractor in their language of choice inside of 30 or so minutes. Hooking
up a DEFLATE library for decompression would be the most difficult part.</p>

<p>Since it supports multiple attachments and behaves like an archive format,
my tool supports flags much like <code class="language-plaintext highlighter-rouge">tar</code>: <code class="language-plaintext highlighter-rouge">-c</code> to create attachments
(default and implicit), <code class="language-plaintext highlighter-rouge">-t</code> to list attachments, and <code class="language-plaintext highlighter-rouge">-x</code> to extract
attachments. PNG data is always passed on standard input and standard
output.</p>

<p>For example, to render a Graphviz diagram and attach the source all at
once:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dot -Tpng graph.dot | pngattach graph.dot &gt;graph.png
</code></pre></div></div>

<p>Later on someone might extract it and tweak it, like so (<code class="language-plaintext highlighter-rouge">-v</code> verbose,
lists files as they’re extracted, like <code class="language-plaintext highlighter-rouge">tar</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pngattach -xv &lt;graph.png
graph.dot
$ vi graph.dot
$ dot -Tpng graph.dot &gt;graph.png
</code></pre></div></div>

<p>Like <code class="language-plaintext highlighter-rouge">tar</code>, it can also write attachments to standard output with <code class="language-plaintext highlighter-rouge">-O</code>.
For example, to re-render the image as an SVG:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pngattach -xO &lt;graph.png | dot -Tsvg &gt;graph.svg
</code></pre></div></div>

<p>Strictly processing standard input to output, rather than taking the input
as an argument, is something I’ve been trying lately. I’m pretty happy
with my <a href="/blog/2020/08/01/">command line design</a> for <code class="language-plaintext highlighter-rouge">pngattach</code>. The real test will
happen in the future, when I’ve forgotten the details and have to figure
it out again from my own documentation.</p>

<p>Curiously, lots of common software refuses to handle PNGs containing large
chunks, and so your PNG may not display if you attach a file even as small
as a few MiB. A defense against denial of service?</p>

<h3 id="example-png">Example PNG</h3>

<p>I haven’t gone back and embedded attachments in any older articles, but I
may do so in future articles. If you wanted to try it out for yourself,
either with my tool or writing your own for fun, this PNG contains a
compressed attachment:</p>

<p><img src="/img/atch-test.png" alt="" /></p>

<p>I produced it like so (with the help of <a href="https://www.imagemagick.org/">ImageMagick</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo P3 1 1 1 0 1 0 |
      convert ppm:- resize 200 png:- |
      pngattach message.txt &gt;atch-test.png
</code></pre></div></div>

<h3 id="error-handling-addendum">Error handling (addendum)</h3>

<p>Another technique I’ve been trying is Go-style error value returns in C
programs, where the errors-as-values are <code class="language-plaintext highlighter-rouge">const char *</code> pointers to static
string buffers. The contents contain an error message to be displayed to
the user, and errors may be wrapped in more context (what file, what
operation, etc.) as the stack unwinds. A null pointer means no error, i.e.
nil. I’ve used this extensively in <code class="language-plaintext highlighter-rouge">pngattach</code>. Examples of the style:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">nelem</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">return</span> <span class="s">"out of memory"</span><span class="p">;</span>  <span class="c1">// overflow</span>
    <span class="p">}</span>

    <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">nelem</span><span class="o">*</span><span class="nf">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="s">"out of memory"</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// ...</span>

    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
        <span class="k">return</span> <span class="s">"write error"</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>An <code class="language-plaintext highlighter-rouge">errwrap()</code> function builds a new error string in a static buffer. This
simple solution wouldn’t work in a multi-threaded program, but that’s not
the case here. Mine toggles between two static buffers so that it can wrap
recursively.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">errwrap</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pre</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">post</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">char</span> <span class="n">errtmp</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">256</span><span class="p">],</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="n">i</span> <span class="o">=</span> <span class="o">!</span><span class="n">i</span><span class="p">;</span>  <span class="c1">// toggle between two static buffers</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">errtmp</span><span class="p">[</span><span class="n">n</span><span class="p">],</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">errtmp</span><span class="p">[</span><span class="n">n</span><span class="p">]),</span> <span class="s">"%s: %s"</span><span class="p">,</span> <span class="n">pre</span><span class="p">,</span> <span class="n">post</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">errtmp</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then I can do stuff like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">errwrap</span><span class="p">(</span><span class="s">"failed to open file"</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>And that can keep being wrapped on the way up:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">err</span> <span class="o">=</span> <span class="n">png_write</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="n">errwrap</span><span class="p">(</span><span class="s">"writing PNG"</span><span class="p">,</span> <span class="n">err</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>So that ultimately the user sees something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pngattach: writing PNG: failed to open file: example.png
</code></pre></div></div>

<p>That’s always printed by a single error printout block at the top level,
where all errors are ultimately routed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// ...</span>

    <span class="n">err</span> <span class="o">=</span> <span class="n">run</span><span class="p">(</span><span class="n">options</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"pngattach: %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">err</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Some sanity for C and C++ development on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/12/30/"/>
    <id>urn:uuid:2e417030-915f-4897-99ff-2a0dafd0ac89</id>
    <updated>2021-12-30T23:25:53Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>A hard reality of C and C++ software development on Windows is that there
has never been a good, native C or C++ standard library implementation for
the platform. A standard library should abstract over the underlying host
facilities in order to ease portable software development. On Windows, C
and C++ is so poorly hooked up to operating system interfaces that most
portable or mostly-portable software — programs which work perfectly
elsewhere — are subtly broken on Windows, particularly outside of the
English-speaking world. The reasons are almost certainly political,
originally motivated by vendor lock-in, than technical, which adds insult
to injury. This article is about what’s wrong, how it’s wrong, and some
easy techniques to deal with it in portable software.</p>

<p>There are <a href="/blog/2016/06/13/">multiple C implementations</a>, so how could they all be
bad, even the <a href="/blog/2018/04/13/">early ones</a>? Microsoft’s C runtime has defined how
the standard library should work on the platform, and everyone else
followed along for the sake of compatibility. I’m excluding <a href="https://www.cygwin.com/">Cygwin</a> and
its major fork, <a href="https://www.msys2.org/">MSYS2</a>, despite not inheriting any of these flaws. They
change so much that they’re effectively whole new platforms, not truly
“native” to Windows.</p>

<p>In practice, C++ standard libraries are implemented on top of a C standard
library, which is why C++ shares the same problems. CPython dodges these
issues: Though written in C, on Windows it bypasses the broken C standard
library and directly calls the proprietary interfaces. Other language
implementations, such “gc” Go, simply aren’t built on C at all, and
instead do things correctly in the first place — the behaviors the C
runtimes should have had all along.</p>

<p>If you’re just working on one large project, bypassing the C runtime isn’t
such a big deal, and you’re likely already doing so to access important
platform functionality. You don’t really even need a C runtime. However,
if you write many small programs, <a href="https://github.com/skeeto/scratch">as I do</a>, writing the same
special Windows support for each one ends up being most of the work, and
honestly makes properly supporting Windows not worth the trouble. I end up
just accepting the broken defaults most of the time.</p>

<p>Before diving into the details, if you’re looking for a quick-and-easy
solution for the Mingw-w64 toolchain, <a href="/blog/2020/05/15/">including w64devkit</a>, which
magically makes your C and C++ console programs behave well on Windows,
I’ve put together a “library” named <strong><a href="https://github.com/skeeto/scratch/tree/master/libwinsane">libwinsane</a></strong>. It solves all
problems discussed in this article, except for one. No source changes
required, simply link it into your program.</p>

<h3 id="what-exactly-is-broken">What exactly is broken?</h3>

<p>The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix,
and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy
API, where an active <em>code page</em> maps 256 bytes onto (up to) 256 specific
characters. On typical machines configured for European languages, this
means <a href="https://en.wikipedia.org/wiki/Windows-1252">code page 1252</a>. <a href="http://simonsapin.github.io/wtf-8/">Roughly speaking</a>, Windows
internally uses UTF-16, and calls through the narrow interface use the
active code page to translate the narrow strings to wide strings. The
result is that calls through the narrow API have limited access to the
system.</p>

<p>The UTF-8 encoding was invented in 1992 and standardized by January 1993.
UTF-8 was adopted by the unix world over the following years due to <a href="/blog/2017/10/06/#what-is-utf-8">its
backwards-compatibility</a> with its existing interfaces. Programs
could read and write Unicode data, access Unicode paths, pass Unicode
arguments, and get and set Unicode environment variables without needing
to change anything. Today UTF-8 has become the dominant text encoding
format in the world, in large part due to the world wide web.</p>

<p>In July 1993, Microsoft introduced the wide Windows API with the release
of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather
than UTF-8. This turned out to be a mistake, since <a href="http://utf8everywhere.org/">UTF-16 is inferior to
UTF-8 in practically every way</a>, though admittedly some problems
weren’t so obvious at the time.</p>

<p>The major problem: <strong>The C and C++ standard libraries only hook up to the
narrow Windows interfaces</strong>. The standard library, and therefore typical
portable software on Windows, cannot handle anything but ASCII. The
effective result is that these programs:</p>

<ul>
  <li>Cannot accept non-ASCII arguments</li>
  <li>Cannot get/set non-ASCII environment variables</li>
  <li>Cannot access non-ASCII paths</li>
  <li>Cannot read and write non-ASCII on a console</li>
</ul>

<p>Doing any of these requires calling proprietary functions, treating
Windows as a special target. It’s part of what makes correctly porting
software to Windows so painful.</p>

<p>The sensible solution would have been for the C runtime to speak UTF-8 and
connect to the wide API. Alternatively, the narrow API could have been
changed over to UTF-8, phasing out the old code page concept. In theory
this is what the UTF-8 “code page” is about, though it doesn’t always
work. There would have been compatibility problems with abruptly making
such a change, but until very recently, <em>this wasn’t even an option</em>. Why
couldn’t there be a switch I could flip to get sane behavior that works
like every other platform?</p>

<h3 id="how-to-mostly-fix-unicode-support">How to mostly fix Unicode support</h3>

<p>In 2019, Microsoft introduced a feature to allow programs to <a href="https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page">request
UTF-8 as their active code page on start</a>, along with supporting
UTF-8 on more narrow API functions. This is like the magic switch I
wanted, except that it involves embedding some ugly XML into your binary
in a particular way. At least it’s now an option.</p>

<p>For Mingw-w64, that means writing a resource file like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include &lt;winuser.h&gt;
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"
</code></pre></div></div>

<p>Compiling it with <code class="language-plaintext highlighter-rouge">windres</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ windres -o manifest.o manifest.rc
</code></pre></div></div>

<p>Then linking that into your program. Amazingly it mostly works! Programs
can access Unicode arguments, Unicode environment variables, and Unicode
paths, including with <code class="language-plaintext highlighter-rouge">fopen</code>, just as it’s worked on other platforms for
decades. Since the active code page is set at load time, it happens before
<code class="language-plaintext highlighter-rouge">argv</code> is constructed (from <code class="language-plaintext highlighter-rouge">GetCommandLineA</code>), which is why that works
out.</p>

<p>Alternatively you could create a “side-by-side assembly” placing that XML
in a file with the same name as your EXE but with <code class="language-plaintext highlighter-rouge">.manifest</code> suffix
(after the <code class="language-plaintext highlighter-rouge">.exe</code> suffix), then placing that next to your EXE. Just be
mindful that there’s a “side-by-side” cache (WinSxS), and so it might not
immediately pick up your changes.</p>

<p>What <em>doesn’t</em> work is console input and output since the console is
external to the process, and so isn’t covered by the process’s active code
page. It must be configured separately using a proprietary call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SetConsoleOutputCP</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">);</span>
</code></pre></div></div>

<p>Annoying, but at least it’s not <em>that</em> painful. This only covers output,
though, meaning programs can only print UTF-8. Unfortunately <a href="https://github.com/microsoft/terminal/issues/4551#issuecomment-585487802">UTF-8 input
still doesn’t work</a>, and setting the input code page doesn’t do
anything despite reporting success:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SetConsoleCP</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">);</span>  <span class="c1">// doesn't work</span>
</code></pre></div></div>

<p>If you care about reading interactive Unicode input, you’re <a href="/blog/2020/05/04/">stuck
bypassing the C runtime</a> since it’s still broken.</p>

<h3 id="text-stream-translation">Text stream translation</h3>

<p>Another long-standing issue is that C and C++ on Windows has distinct
“text” and “binary” streams, which it inherited from DOS. Mainly this
means automatic newline conversion between CRLF and LF. The C standard
explicitly allows for this, though unix-like platforms have never actually
distinguished between text and binary streams.</p>

<p>The standard also specifies that standard input, output, and error are all
open as text streams, and there’s no portable method to change the stream
mode to binary — a serious deficiency with the standard. On unix-likes
this doesn’t matter, but on Windows it means programs can’t read or write
binary data on standard streams without calling a non-standard function.
It also means reading and writing standard streams is slow, <a href="/blog/2021/12/04/">frequently a
bottleneck</a> unless I route around it.</p>

<p>Personally, I like <a href="/blog/2020/06/29/">writing binary data to standard output</a>,
<a href="/blog/2020/11/24/">including video</a>, and sometimes <a href="/blog/2017/07/02/">binary filters</a> that also read
binary input. I do it so often that in probably half my C programs I have
this snippet in <code class="language-plaintext highlighter-rouge">main</code> just so they work correctly on Windows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="cp">#ifdef _WIN32
</span>    <span class="kt">int</span> <span class="nf">_setmode</span><span class="p">(</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span>
    <span class="n">_setmode</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mh">0x8000</span><span class="p">);</span>
    <span class="n">_setmode</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mh">0x8000</span><span class="p">);</span>
    <span class="cp">#endif
</span></code></pre></div></div>

<p>That incantation sets standard input and output in the C runtime to binary
mode without the need to include a header, making it compact, simple, and
self-contained.</p>

<p>This built-in newline translation, along with the Windows standard text
editor, Notepad, <a href="https://devblogs.microsoft.com/commandline/extended-eol-in-notepad/">lagging decades behind</a>, meant that many other
programs, including Git, grew their own, annoying, newline conversion
<a href="https://github.com/skeeto/w64devkit/issues/10">misfeatures</a> that cause <a href="https://github.com/skeeto/binitools/commit/2efd690c3983856c9633b0be66d57483491d1e10">other problems</a>.</p>

<h3 id="libwinsane">libwinsane</h3>

<p>I introduced libwinsane at the beginning of the article, which fixes all
this simply by being linked into a program. It includes the magic XML
manifest <code class="language-plaintext highlighter-rouge">.rsrc</code> section, configures the console for UTF-8 output, and
sets standard streams to binary before <code class="language-plaintext highlighter-rouge">main</code> (via a GCC constructor). I
called it a “library”, but it’s actually a single object file. It can’t be
a static library since it must be linked into the program despite not
actually being referenced by the program.</p>

<p>So normally this program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="n">argv</span><span class="p">[</span><span class="n">argc</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">arg</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%zu %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled and run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\&gt;cc -o example example.c
C:\&gt;example π
1 p
</code></pre></div></div>

<p>As usual, the Unicode argument is silently mangled into one byte. Linked
with libwinsane, it just works like everywhere else:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:\&gt;gcc -o example example.c libwinsane.o
C:\&gt;example π
2 π
</code></pre></div></div>

<p>If you’re maintaining a substantial program, you probably want to copy and
integrate the necessary parts of libwinsane into your project and build,
rather than always link against this loose object file. This is more for
convenience and for succinctly capturing the concept. You may even want to
<a href="https://github.com/skeeto/hastyhex/blob/f03b6e0f/hastyhex.c#L298-L309">enable ANSI escape processing</a> in your version.</p>

<p><strong>Update December 2024</strong>: Pavel Galkin <a href="https://lists.sr.ht/~skeeto/public-inbox/%3Cdf749edc-0413-4735-9cf2-c77db202cc6e@app.fastmail.com%3E">demonstrates how <code class="language-plaintext highlighter-rouge">libwinsane.o</code>
changes the console state</a>, which affects all processes associated
with the terminal. This is mostly unavoidable, and it’s one reason I’ve
since concluded that UTF-8 manifests are a poor solution. Better to <a href="/blog/2023/01/18/">solve
the problem using a platform layer</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Fast CSV processing with SIMD</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/12/04/"/>
    <id>urn:uuid:ba6e0ccf-1e11-4c5d-bc53-dd11fbc6da6c</id>
    <updated>2021-12-04T01:13:33Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=29439403">on Hacker News</a>.</em></p>

<p>I recently learned of <a href="https://github.com/dbro/csvquote">csvquote</a>, a tool that encodes troublesome
<a href="https://datatracker.ietf.org/doc/html/rfc4180">CSV</a> characters such that unix tools can correctly process them. It
reverses the encoding at the end of the pipeline, recovering the original
input. The original implementation handles CSV quotes using the
straightforward, naive method. However, there’s a better approach that is
not only simpler, but around 3x faster on modern hardware. Even more,
there’s yet another approach using SIMD intrinsics, plus some bit
twiddling tricks, which increases the processing speed by an order of
magnitude. <a href="https://github.com/skeeto/scratch/tree/master/csvquote"><strong>My csvquote implementation</strong></a> includes both
approaches.</p>

<!--more-->

<h3 id="background">Background</h3>

<p>Records in CSV data are separated by line feeds, and fields are separated
by commas. Fields may be quoted.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaa,bbb,ccc
xxx,"yyy",zzz
</code></pre></div></div>

<p>Fields containing a line feed (U+000A), quotation mark (U+0022), or comma
(U+002C), must be quoted, otherwise they would be ambiguous with the CSV
formatting itself. Quoted quotation marks are turned into a pair of
quotes. For example, here are two records with two fields apiece:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1919–1921, 1923, 1926"
"Frankenstein;
or, The Modern Prometheus",Mary Shelley
</code></pre></div></div>

<p>A CSV-unaware tool splitting on commas and line feeds (ex. <code class="language-plaintext highlighter-rouge">awk</code>) would
process these records improperly. So csvquote translates quoted line feeds
into record separators (U+001E) and commas into unit separators (U+001F).
These control characters rarely appear in normal text data, and can be
trivially processed in UTF-8-encoded text without decoding or encoding.
The above records become:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1919–1921\x1f 1923\x1f 1926"
"Frankenstein;\x1eor\x1f The Modern Prometheus",Mary Shelley
</code></pre></div></div>

<p>I’ve used <code class="language-plaintext highlighter-rouge">\x1e</code> and <code class="language-plaintext highlighter-rouge">\x1f</code> here to illustrate the control characters.</p>

<p>The data is exactly the same length since it’s a straight byte-for-byte
replacement. Quotes are left entirely untouched. The challenge is parsing
the quotes to track whether the two special characters fall inside or
outside pairs of quotes.</p>

<h3 id="state-machine-improvements">State machine improvements</h3>

<p>The original csvquote walks the input a byte at a time and is in one of
three states:</p>

<ol>
  <li>Outside quotes (initial state)</li>
  <li>Inside quotes</li>
  <li>On a possibly “escaped” quote (the first <code class="language-plaintext highlighter-rouge">"</code> in a <code class="language-plaintext highlighter-rouge">""</code>)</li>
</ol>

<p>Since <a href="/blog/2020/12/31/">I love state machines so much</a>, here it is translated into a
switch-based state machine:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Return the next state given an input character.</span>
<span class="kt">int</span> <span class="nf">next</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">3</span> <span class="o">:</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">3</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a href="/img/csv/csv3.dot"><img src="/img/csv/csv3.png" alt="" /></a></p>

<p>The real program also has more conditions for potentially making a
replacement. It’s an awful lot of <a href="/blog/2017/10/06/">performance-killing branching</a>.</p>

<p>However, this <a href="https://vimeo.com/644068002">context</a> is about finding “in” and “out” — not validating
the CSV — so the “escape” state is unnecessary. I need only match up pairs
of quotes. An “escaped” quote can be considered terminating a quoted
region and immediately starting a new quoted region. That’s means there’s
just the first two states in a trivial arrangement:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">next</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="k">return</span> <span class="n">c</span> <span class="o">==</span> <span class="sc">'"'</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a href="/img/csv/csv2.dot"><img src="/img/csv/csv2.png" alt="" /></a></p>

<p>Since the text can be processed as bytes, there are only 256 possible
inputs. With 2 states and 256 inputs, this state machine, <em>with</em>
replacement machinery, can be implemented with a 512-byte table and <em>no
branches</em>. Here’s the table initialization:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">table</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">256</span><span class="p">];</span>

<span class="kt">void</span> <span class="nf">init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">256</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">table</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="sc">'\n'</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x1e</span><span class="p">;</span>
    <span class="n">table</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="sc">','</span><span class="p">]</span>  <span class="o">=</span> <span class="mh">0x1f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the first state, characters map onto themselves. In the second state,
characters map onto their replacements. This is the <em>entire</em> encoder and
decoder:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">encode</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span> <span class="o">^=</span> <span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'"'</span><span class="p">);</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Well, strictly speaking, the decoder need not process quotes. By my
benchmark (<code class="language-plaintext highlighter-rouge">csvdump</code> in my implementation) this processes at ~1 GiB/s on
my laptop — 3x faster than the original. However, there’s still
low-hanging fruit to be picked!</p>

<h3 id="simd-and-twos-complement">SIMD and two’s complement</h3>

<p>Any decent SIMD implementation is going to make use of masking. Find the
quotes, compute a mask over quoted regions, compute another mask for
replacement matches, combine the masks, then use that mask to blend the
input with the replacements. Roughly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quotes    = find_quoted_regions(input)
linefeeds = input == '\n'
commas    = input == ','
output    = blend(input, '\n', quotes &amp; linefeeds)
output    = blend(output, ',', quotes &amp; commas)
</code></pre></div></div>

<p>The hard part is computing the quote mask, and also somehow handle quoted
regions straddling SIMD chunks (not pictured), <em>and</em> do all that without
resorting to slow byte-at-time operations. Fortunately there are some
bitwise tricks that can resolve each issue.</p>

<p>Imagine I load 32 bytes into a SIMD register (e.g. AVX2), and I compute a
32-bit mask where each bit corresponds to one byte. If that byte contains
a quote, the corresponding bit is set.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001010
</code></pre></div></div>

<p>That last/lowest 1 corresponds to the beginning of a quoted region. For my
mask, I’d like to set all bits following that bit. I can do this by
subtracting 1.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001001
</code></pre></div></div>

<p>Using the <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan">Kernighan technique</a> I can also remove this bit from the
original input by ANDing them together.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
10000000000000011000011000001000
</code></pre></div></div>

<p>Now I’m left with a new bottom bit. If I repeat this, I build up layers of
masks, one for each input quote.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>10000000000000011000011000001001
10000000000000011000011000000111
10000000000000011000010111111111
10000000000000011000001111111111
10000000000000010111111111111111
10000000000000001111111111111111
01111111111111111111111111111111
</code></pre></div></div>

<p>Remember how I use XOR in the state machine above to toggle between
states? If I XOR all these together, I toggle the quotes on and off,
building up quoted regions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
01111111111111100111100111110001
</code></pre></div></div>

<p>However, for reasons I’ll explain shortly, it’s critical that the opening
quote is included in this mask. If I XOR the pre-subtracted value with the
mask when I compute the mask, I can toggle the remaining quotes on and off
such that the opening quotes are included. Here’s my function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">find_quoted_regions</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="n">x</span><span class="p">;</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">&amp;=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which gives me exactly what I want:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"George Herman ""Babe"" Ruth","1
11111111111111101111101111110011
</code></pre></div></div>

<p>It’s important that the opening quote is included because it means a
region that begins on the last byte will have that last bit set. I can use
that last bit to determine if the next chunk begins in a quoted state. If
a region begins in a quoted state, I need only NOT the whole result to
reverse the quoted regions.</p>

<p>How can I “sign extend” a 1 into all bits set, or do nothing for zero?
Negate it!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">uint32_t</span> <span class="n">carry</span>  <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="n">prev</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">quotes</span> <span class="o">=</span> <span class="n">find_quoted_regions</span><span class="p">(</span><span class="n">input</span><span class="p">)</span> <span class="o">^</span> <span class="n">carry</span><span class="p">;</span>
    <span class="c1">// ...</span>
    <span class="n">prev</span> <span class="o">=</span> <span class="n">quotes</span><span class="p">;</span>
</code></pre></div></div>

<p>That takes care of computing quoted regions and chaining them between
chunks. The loop will unfortunately cause branch prediction penalties if
the input has lots of quotes, but I couldn’t find a way around this.</p>

<p>However, I’ve made a serious mistake. I’m using <code class="language-plaintext highlighter-rouge">_mm256_movemask_epi8</code> and
it puts the first byte in the lowest bit. Doh! That means it looks like
this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1","htuR ""ebaB"" namreH egroeG"
01010000011000011000000000000001
</code></pre></div></div>

<p>There’s no efficient way to flip the bits around, so I just need to find a
way to work in the other direction. To flip the bits to the left of a set
bit, negate it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000000000000000000010000000 = +0x00000080
11111111111111111111111110000000 = -0x00000080
</code></pre></div></div>

<p>Unlike before, this keeps the original bit set, so I need to XOR the
original value into the input to flip the quotes. This is as simple as
initializing to the input rather than zero. The new loop:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">find_quoted_regions</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">r</span> <span class="o">^=</span> <span class="o">-</span><span class="n">x</span> <span class="o">^</span> <span class="n">x</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">&amp;=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1","htuR ""ebaB"" namreH egroeG"
11001111110111110111111111111111
</code></pre></div></div>

<p>The carry now depends on the high bit rather than the low bit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uint32_t carry = -(prev &gt;&gt; 31);
</code></pre></div></div>

<h3 id="reversing-movemask">Reversing movemask</h3>

<p>The next problem: for reasons I don’t understand, AVX2 does not include
the inverse of <code class="language-plaintext highlighter-rouge">_mm256_movemask_epi8</code>. Converting the bit-mask back into a
byte-mask requires some clever shuffling. Fortunately <a href="https://web.archive.org/web/20150506071030/https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb">I’m not the first
to have this problem</a>, and so I didn’t have to figure it out from
scratch.</p>

<p>First fill the 32-byte register with repeated copies of the 32-bit mask.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>abcdabcdabcdabcdabcdabcdabcdabcd
</code></pre></div></div>

<p>Shuffle the bytes so that the first 8 register bytes have the same copy of
the first bit-mask byte, etc.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aaaaaaaabbbbbbbbccccccccdddddddd
</code></pre></div></div>

<p>In byte 0, I care only about bit 0, in byte 1 I care only about the bit 1,
… in byte N I care only about bit <code class="language-plaintext highlighter-rouge">N%8</code>. I can pre-compute a mask to
isolate each of these bits and produce a proper byte-wise mask from the
bit-mask. Fortunately all this isn’t too bad: four instructions instead of
the one I had wanted. It looks like a lot of code, but it’s really only a
few instructions.</p>

<h3 id="results">Results</h3>

<p>In my benchmark, which includes randomly occurring quoted fields, the SIMD
version processes at ~4 GiB/s — 10x faster than the original. I haven’t
profiled, but I expect mispredictions on the bit-mask loop are the main
obstacle preventing the hypothetical 32x speedup.</p>

<p>My version also optionally rejects inputs containing the two special
control characters since the encoding would be irreversible. This is
implemented in SIMD when available, and it slows processing by around 10%.</p>

<h3 id="followup-pclmulqdq">Followup: PCLMULQDQ</h3>

<p>Geoff Langdale and others have <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCABwTFSrDpNkmJs6TpkAfofcZq6e8YWaJUur20xZBz7mDBnvQ2w%40mail.gmail.com%3E">graciously pointed out PCLMULQDQ</a>,
which can <a href="https://wunkolo.github.io/post/2020/05/pclmulqdq-tricks/">compute the quote masks using carryless multiplication</a>
(<a href="https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/">also</a>) entirely in SIMD and without a loop. I haven’t yet quite
worked out exactly how to apply it, but it should be much faster.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>OpenBSD's pledge and unveil from Python</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/09/15/"/>
    <id>urn:uuid:cd3857dd-270c-430e-824d-6512688687a3</id>
    <updated>2021-09-15T02:46:56Z</updated>
    <category term="bsd"/><category term="c"/><category term="python"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=28535255">on Hacker News</a>.</em></p>

<p>Years ago, OpenBSD gained two new security system calls, <a href="https://man.openbsd.org/pledge.2"><code class="language-plaintext highlighter-rouge">pledge(2)</code></a>
(originally <a href="https://www.openbsd.org/papers/tame-fsec2015/mgp00001.html"><code class="language-plaintext highlighter-rouge">tame(2)</code></a>) and <a href="https://man.openbsd.org/unveil.2"><code class="language-plaintext highlighter-rouge">unveil</code></a>. In both, an application
surrenders capabilities at run-time. The idea is to perform initialization
like usual, then drop capabilities before handling untrusted input,
limiting unwanted side effects. This feature is applicable even where type
safety isn’t an issue, such as Python, where a program might still get
tricked into accessing sensitive files or making network connections when
it shouldn’t. So how can a Python program access these system calls?</p>

<p>As <a href="/blog/2021/06/29/">discussed previously</a>, it’s quite easy to access C APIs from
Python through its <a href="https://docs.python.org/3/library/ctypes.html"><code class="language-plaintext highlighter-rouge">ctypes</code></a> package, and this is no exception.
In this article I show how to do it. Here’s the full source if you want to
dive in: <a href="https://github.com/skeeto/scratch/tree/master/misc/openbsd.py"><strong><code class="language-plaintext highlighter-rouge">openbsd.py</code></strong></a>.</p>

<!--more-->

<p>I’ve chosen these extra constraints:</p>

<ul>
  <li>
    <p>As extra safety features, unnecessary for correctness, attempts to call
these functions on systems where they don’t exist will silently do
nothing, as though they succeeded. They’re provided as a best effort.</p>
  </li>
  <li>
    <p>Systems other than OpenBSD may support these functions, now or in the
future, and it would be nice to automatically make use of them when
available. This means no checking for OpenBSD specifically but instead
<em>feature sniffing</em> for their presence.</p>
  </li>
  <li>
    <p>The interfaces should be Pythonic as though they were implemented in
Python itself. Raise exceptions for errors, and accept strings since
they’re more convenient than bytes.</p>
  </li>
</ul>

<p>For reference, here are the function prototypes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">pledge</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">promises</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">execpromises</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">unveil</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">permissions</span><span class="p">);</span>
</code></pre></div></div>

<p>The <a href="https://flak.tedunangst.com/post/string-interfaces">string-oriented interface of <code class="language-plaintext highlighter-rouge">pledge</code></a> will make this a whole
lot easier to implement.</p>

<h3 id="finding-the-functions">Finding the functions</h3>

<p>The first step is to grab functions through <code class="language-plaintext highlighter-rouge">ctypes</code>. Like a lot of Python
documentation, this area is frustratingly imprecise and under-documented.
I want to grab a handle to the already-linked libc and search for either
function. However, getting that handle is a little different on each
platform, and in the process I saw four different exceptions, only one of
which is documented.</p>

<p>I came up with passing None to <code class="language-plaintext highlighter-rouge">ctypes.CDLL</code>, which ultimately just passes
<code class="language-plaintext highlighter-rouge">NULL</code> to <a href="https://man.openbsd.org/dlopen.3"><code class="language-plaintext highlighter-rouge">dlopen(3)</code></a>. That’s really all I wanted. Currently on
Windows this is a TypeError. Once the handle is in hand, try to access the
<code class="language-plaintext highlighter-rouge">pledge</code> attribute, which will fail with AttributeError if it doesn’t
exist. In the event of any exception, just assume the behavior isn’t
available. If found, I also define the function prototype for <code class="language-plaintext highlighter-rouge">ctypes</code>.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_pledge</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">try</span><span class="p">:</span>
    <span class="n">_pledge</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">use_errno</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">pledge</span>
    <span class="n">_pledge</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span>
    <span class="n">_pledge</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_char_p</span><span class="p">,</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_char_p</span>
<span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
    <span class="n">_pledge</span> <span class="o">=</span> <span class="bp">None</span>
</code></pre></div></div>

<p>Catching a broad Exception isn’t great, but it’s the best we can do since
the documentation is incomplete. From this block I’ve seen TypeError,
AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if
there are more possibilities, and I don’t want to risk missing them.</p>

<p>Note that I’m catching Exception rather than using a bare <code class="language-plaintext highlighter-rouge">except</code>. My
code will not catch KeyboardInterrupt nor SystemExit. This is deliberate,
and I never want to catch these.</p>

<p>The same story for <code class="language-plaintext highlighter-rouge">unveil</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">_unveil</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">try</span><span class="p">:</span>
    <span class="n">_unveil</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">use_errno</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">unveil</span>
    <span class="n">_unveil</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span>
    <span class="n">_unveil</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_char_p</span><span class="p">,</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_char_p</span>
<span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
    <span class="n">_unveil</span> <span class="o">=</span> <span class="bp">None</span>
</code></pre></div></div>

<h3 id="pythonic-wrappers">Pythonic wrappers</h3>

<p>The next and final step is to wrap the low-level call in an interface that
hides their C and <code class="language-plaintext highlighter-rouge">ctypes</code> nature.</p>

<p>Python strings must be encoded to bytes before they can be passed to C
functions. Rather than make the caller worry about this, we’ll let them
pass friendly strings and have the wrapper do the conversion. Either may
also be <code class="language-plaintext highlighter-rouge">NULL</code>, so None is allowed.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">pledge</span><span class="p">(</span><span class="n">promises</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">execpromises</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">_pledge</span><span class="p">:</span>
        <span class="k">return</span>  <span class="c1"># unimplemented
</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_pledge</span><span class="p">(</span><span class="bp">None</span> <span class="k">if</span> <span class="n">promises</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">promises</span><span class="p">.</span><span class="n">encode</span><span class="p">(),</span>
                <span class="bp">None</span> <span class="k">if</span> <span class="n">execpromises</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">execpromises</span><span class="p">.</span><span class="n">encode</span><span class="p">())</span>
    <span class="k">if</span> <span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
        <span class="n">errno</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">get_errno</span><span class="p">()</span>
        <span class="k">raise</span> <span class="nb">OSError</span><span class="p">(</span><span class="n">errno</span><span class="p">,</span> <span class="n">os</span><span class="p">.</span><span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">))</span>
</code></pre></div></div>

<p>As usual, a return of -1 means there was an error, in which case we fetch
<code class="language-plaintext highlighter-rouge">errno</code> and raise the appropriate OSError.</p>

<p><code class="language-plaintext highlighter-rouge">unveil</code> works a little differently since the first argument is a path.
Python functions that accept paths, such as <code class="language-plaintext highlighter-rouge">open</code>, generally accept
either strings or bytes. On unix-like systems, <a href="https://simonsapin.github.io/wtf-8/">paths are fundamentally
bytestrings</a> and not necessarily Unicode, so it’s necessary to accept
bytes. Since strings are nearly always more convenient, they take both.
The <code class="language-plaintext highlighter-rouge">unveil</code> wrapper here will do the same. If it’s a string, encode it,
otherwise pass it straight through.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">unveil</span><span class="p">(</span><span class="n">path</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">,</span> <span class="bp">None</span><span class="p">],</span> <span class="n">permissions</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">_unveil</span><span class="p">:</span>
        <span class="k">return</span>  <span class="c1"># unimplemented
</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_unveil</span><span class="p">(</span><span class="n">path</span><span class="p">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="nb">str</span><span class="p">)</span> <span class="k">else</span> <span class="n">path</span><span class="p">,</span>
                <span class="bp">None</span> <span class="k">if</span> <span class="n">permissions</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">permissions</span><span class="p">.</span><span class="n">encode</span><span class="p">())</span>
    <span class="k">if</span> <span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
        <span class="n">errno</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">get_errno</span><span class="p">()</span>
        <span class="k">raise</span> <span class="nb">OSError</span><span class="p">(</span><span class="n">errno</span><span class="p">,</span> <span class="n">os</span><span class="p">.</span><span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">))</span>
</code></pre></div></div>

<p>That’s it!</p>

<h3 id="trying-it-out">Trying it out</h3>

<p>Let’s start with <code class="language-plaintext highlighter-rouge">unveil</code>. Initially a process has access to the whole
file system with the usual restrictions. On the first call to <code class="language-plaintext highlighter-rouge">unveil</code>
it’s immediately restricted to some subset of the tree. Each call reveals
a little more until a final <code class="language-plaintext highlighter-rouge">NULL</code> which locks it in place for the rest of
the process’s existence.</p>

<p>Suppose a program has been tricked into accessing your shell history,
perhaps by mishandling a path:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hackme</span><span class="p">():</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">pathlib</span><span class="p">.</span><span class="n">Path</span><span class="p">.</span><span class="n">home</span><span class="p">()</span> <span class="o">/</span> <span class="s">".bash_history"</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"You've been hacked!"</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">FileNotFoundError</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Blocked by unveil."</span><span class="p">)</span>

<span class="n">hackme</span><span class="p">()</span>
</code></pre></div></div>

<p>If you’re a Bash user, this prints:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You've been hacked!
</code></pre></div></div>

<p>Using our new feature to restrict the program’s access first:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># restrict access to static program data
</span><span class="n">unveil</span><span class="p">(</span><span class="s">"/usr/share"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span>
<span class="n">unveil</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>

<span class="n">hackme</span><span class="p">()</span>
</code></pre></div></div>

<p>On OpenBSD this now prints:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Blocked by unveil.
</code></pre></div></div>

<p>Working just as it should!</p>

<p>With <code class="language-plaintext highlighter-rouge">pledge</code> we declare what abilities we’d like to keep by supplying a
list of promises, <em>pledging</em> to use only those abilities afterward. A
common case is the <code class="language-plaintext highlighter-rouge">stdio</code> promise which allows reading and writing of
open files, but not <em>opening</em> files. A program might open its log file,
then drop the ability to open files while retaining the ability to write
to its log.</p>

<p>An invalid or unknown promise is an error. Does that work?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt;&gt;&gt; pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument
</code></pre></div></div>

<p>So far so good. How about the functionality itself?</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pledge</span><span class="p">(</span><span class="s">"stdio"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="n">hackme</span><span class="p">()</span>
</code></pre></div></div>

<p>The program is instantly killed when making the disallowed system call:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Abort trap (core dumped)
</code></pre></div></div>

<p>If you want something a little softer, include the <code class="language-plaintext highlighter-rouge">error</code> promise:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pledge</span><span class="p">(</span><span class="s">"stdio error"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="n">hackme</span><span class="p">()</span>
</code></pre></div></div>

<p>Instead it’s an exception, which will be a lot easier to debug when it
comes to Python, so you probably always want to use it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OSError: [Errno 78] Function not implemented
</code></pre></div></div>

<p>The core dump isn’t going to be much help to a Python program, so you
probably always want to use this promise. In general you need to be extra
careful about <code class="language-plaintext highlighter-rouge">pledge</code> in complex runtimes like Python’s which may
reasonably need to do many arbitrary, undocumented things at any time.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Billions of Code Name Permutations in 32 bits</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/09/14/"/>
    <id>urn:uuid:bc17a779-bee1-4a60-80d1-5c5cfd8fd638</id>
    <updated>2021-09-14T21:06:59Z</updated>
    <category term="c"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>My friend over at Possibly Wrong <a href="https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/">created a code name generator</a>. By
coincidence I happened to be thinking about code names myself while
recently replaying <a href="https://en.wikipedia.org/wiki/XCOM:_Enemy_Within"><em>XCOM: Enemy Within</em></a> (2012/2013). The game
generates a random code name for each mission, and I wondered how often it
repeats. The <a href="https://www.ufopaedia.org/index.php/Mission_Names_(EU2012)">UFOpaedia page on the topic</a> gives the word lists: 53
adjectives and 76 nouns, for a total of 4028 possible code names. A
typical game has around 60 missions, and if code names are generated
naively on the fly, then per the birthday paradox around half of all games
will see a repeated mission code name! Fortunately this is easy to avoid,
and the particular configuration here lends itself to an interesting
implementation.</p>

<p>Mission code names are built using “<em>adjective</em> <em>noun</em>”. Some examples
from the game’s word list:</p>

<ul>
  <li>Fading Hammer</li>
  <li>Fallen Jester</li>
  <li>Hidden Crown</li>
</ul>

<p>To generate a code name, we could select a random adjective and a random
noun, but as discussed it wouldn’t take long for a collision. The naive
approach is to keep a database of previously-generated names, and to
consult this database when generating new names. That works, but there’s
an even better solution: use a random permutation. Done well, we don’t
need to keep track of previous names, and the generator won’t repeat until
it’s exhausted all possibilities.</p>

<p>Further, the total number of possible code names, 4028, is suspiciously
shy of 4,096, a power of two (<code class="language-plaintext highlighter-rouge">2**12</code>). That makes designing and
implementing an efficient permutation that much easier.</p>

<h3 id="a-linear-congruential-generator">A linear congruential generator</h3>

<p>A classic, obvious solution is a <a href="/blog/2019/11/19/">linear congruential generator</a>
(LCG). A full-period, 12-bit LCG is nothing more than a permutation of the
numbers 0 to 4,095. When generating names, we can skip over the extra 68
values and pretend it’s a permutation of 4,028 elements. An LCG is
constructed like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(n) = (f(n-1)*A + C) % M
</code></pre></div></div>

<p>Typically the seed is used for <code class="language-plaintext highlighter-rouge">f(0)</code>. M is selected based on the problem
space or implementation efficiency, and usually a power of two. In this
case it will be 4,096. Then there are some rules for choosing A and C.</p>

<p>Simply choosing a random <code class="language-plaintext highlighter-rouge">f(0)</code> per game isn’t great. The code name order
will always be the same, and we’re only choosing where in the cycle to
start. It would be better to vary the permutation itself, which we can do
by also choosing unique A and C constants per game.</p>

<p>Choosing C is easy: It must be relatively prime with M, i.e. it must be
odd. Since it’s addition modulo M, there’s no reason to choose <code class="language-plaintext highlighter-rouge">C &gt;= M</code>
since the results are identical to a smaller C. If we think of C as a
12-bit integer, 1 bit is locked in, and the other 11 bits are free to
vary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxxxx1
</code></pre></div></div>

<p>Choosing A is more complicated: must be odd, <code class="language-plaintext highlighter-rouge">A-1</code> must be divisible by 4,
and <code class="language-plaintext highlighter-rouge">A-1</code> should be divisible by 8 (better results). Again, thinking of
this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits
free:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxx101
</code></pre></div></div>

<p>This ensures all the <em>must</em> and <em>should</em> properties of A.</p>

<p>Finally <code class="language-plaintext highlighter-rouge">0 &lt;= f(0) &lt; M</code>. Because of modular arithmetic larger, values are
redundant, and all possible values are valid since the LCG, being
full-period, will cycle through all of them. This is just choosing the
starting point in a particular permutation cycle. As a 12-bit number, all
12 bits are free:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxxxxxxxxxxx
</code></pre></div></div>

<p>That’s <code class="language-plaintext highlighter-rouge">9 + 11 + 12 = 32</code> free bits to fill randomly: again, how
incredibly convenient! Every 32-bit integer defines some unique code name
permutation… <em>almost</em>. Any 32-bit descriptor where <code class="language-plaintext highlighter-rouge">f(0) &gt;= 4028</code> will
collide with at least one other due to skipping, and so around 1.7% of the
state space is redundant. A small loss that should shrink with slightly
better word list planning. I don’t think anyone will notice.</p>

<h3 id="slice-and-dice">Slice and dice</h3>

<p><a href="/blog/2020/12/31/">I love compact state machines</a>, and this is an opportunity to put one
to good use. My code name generator will be just one function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>This takes one of those 32-bit permutation descriptors, writes the first
code name to <code class="language-plaintext highlighter-rouge">buf</code>, and returns a descriptor for another permutation that
starts with the next name. All we have to do is keep track of that 32-bit
number and we’ll never need to worry about repeating code names until all
have been exhausted.</p>

<p>First, lets extract A, C, and <code class="language-plaintext highlighter-rouge">f(0)</code>, which I’m calling S. The low bits
are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to
lock in the hard-set bits.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&lt;&lt;</span>  <span class="mi">3</span> <span class="o">|</span> <span class="mi">5</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">//  9 bits</span>
<span class="kt">long</span> <span class="n">c</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">// 11 bits</span>
<span class="kt">long</span> <span class="n">s</span> <span class="o">=</span>  <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>               <span class="c1">// 12 bits</span>
</code></pre></div></div>

<p>Next iterate the LCG until we have a number in range:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">do</span> <span class="p">{</span>
    <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;=</span> <span class="mi">4028</span><span class="p">);</span>
</code></pre></div></div>

<p>Once we have an appropriate LCG state, compute the adjective/noun indexes
and build a code name:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">s</span> <span class="o">%</span> <span class="mi">53</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="mi">53</span><span class="p">;</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
</code></pre></div></div>

<p>Finally assemble the next 32-bit state. Since A and C don’t change, these
are passed through while the old S is masked out and replaced with the new
S.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
</code></pre></div></div>

<p>Putting it all together:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">adjvs</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">nouns</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>

<span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&lt;&lt;</span>  <span class="mi">3</span> <span class="o">|</span> <span class="mi">5</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">//  9 bits</span>
    <span class="kt">long</span> <span class="n">c</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span> <span class="o">|</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>  <span class="c1">// 11 bits</span>
    <span class="kt">long</span> <span class="n">s</span> <span class="o">=</span>  <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>               <span class="c1">// 12 bits</span>

    <span class="k">do</span> <span class="p">{</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfff</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;=</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">)</span><span class="o">*</span><span class="n">COUNTOF</span><span class="p">(</span><span class="n">nouns</span><span class="p">));</span>

    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">s</span> <span class="o">%</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">j</span><span class="p">]);</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The caller just needs to generate an initial 32-bit integer. Any 32-bit
integer is valid — even zero — so this could just be, say, the unix epoch
(<code class="language-plaintext highlighter-rouge">time(2)</code>), but adjacent values will have similar-ish permutations. I
intentionally placed S in the high bits, which are least likely to vary,
since it only affects where the cycle begins, while A and C have a much
more dramatic impact and so are placed at more variable locations.</p>

<p>Regardless, it would be better to hash such an input so that adjacent time
values map to distant states. It also helps hide poorer (less random)
choices for A multipliers. I happen to have <a href="/blog/2018/07/31/">designed some great functions
for exactly this purpose</a>. Here’s one of my best:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">+=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd168aaadU</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xaf723597U</span><span class="p">;</span> <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would be perfectly reasonable for generating all possible names in a
random order:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="n">state</span> <span class="o">=</span> <span class="n">hash32</span><span class="p">(</span><span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">4028</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
    <span class="n">state</span> <span class="o">=</span> <span class="n">codename</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To further help cover up poorer A multipliers, it’s better for the word
list to be pre-shuffled in its static storage. If that underlying order
happens to show through, at least it will be less obvious (i.e. not in
alphabetical order). Shuffling the string list in my source is just a few
keystrokes in Vim, so this is easy enough.</p>

<h3 id="robustness">Robustness</h3>

<p>If you’re set on making the <code class="language-plaintext highlighter-rouge">codename</code> function easier to use such that
consumers don’t need to think about hashes, you could “encode” and
“decode” the descriptor going in an out of the function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">codename</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">state</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">state</span> <span class="o">+=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">*=</span> <span class="mh">0x9e485565U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">*=</span> <span class="mh">0xef1d6b47U</span><span class="p">;</span> <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>

    <span class="c1">// ...</span>

    <span class="n">state</span> <span class="o">=</span> <span class="p">(</span><span class="n">state</span> <span class="o">&amp;</span> <span class="mh">0xfffff</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">s</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">state</span> <span class="o">*=</span> <span class="mh">0xeb00ce77U</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">state</span> <span class="o">*=</span> <span class="mh">0x88ccd46dU</span><span class="p">;</span>
    <span class="n">state</span> <span class="o">^=</span> <span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span> <span class="n">state</span> <span class="o">-=</span> <span class="mh">0x3243f6a8U</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">state</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This permutes the state coming in, and reverses that permutation on the
way out (read: inverse hash). This breaks up similar starting points.</p>

<h3 id="a-random-access-code-name-permutation">A random-access code name permutation</h3>

<p>Of course this isn’t the only way to build a permutation. I recently
picked up another trick: <a href="https://andrew-helmer.github.io/permute/">Kensler permutation</a>. The key insight
is cycle-walking, allowing for random-access to a permutation of a smaller
domain (e.g. 4,028 elements) through permutation of a larger domain (e.g.
4096 elements).</p>

<p>Here’s such a code name generator built around a bespoke 12-bit
xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply
is less effective the smaller the permutation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Generate the nth code name for this seed.</span>
<span class="kt">void</span> <span class="nf">codename_n</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">seed</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x325</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x3f5</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0xa89</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span> <span class="n">i</span> <span class="o">^=</span> <span class="n">seed</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span> <span class="n">i</span> <span class="o">*=</span> <span class="mh">0x85b</span><span class="p">;</span> <span class="n">i</span> <span class="o">&amp;=</span> <span class="mh">0xfff</span><span class="p">;</span>
        <span class="n">i</span> <span class="o">^=</span> <span class="n">i</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">)</span><span class="o">*</span><span class="n">COUNTOF</span><span class="p">(</span><span class="n">nouns</span><span class="p">));</span>

    <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="n">i</span> <span class="o">%</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">b</span> <span class="o">=</span> <span class="n">i</span> <span class="o">/</span> <span class="n">COUNTOF</span><span class="p">(</span><span class="n">adjvs</span><span class="p">);</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="s">"%s %s"</span><span class="p">,</span> <span class="n">adjvs</span><span class="p">[</span><span class="n">a</span><span class="p">],</span> <span class="n">nouns</span><span class="p">[</span><span class="n">b</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While this is more flexible, avoids poorer permutations, and doesn’t have
state space collisions, I still have a soft spot for my LCG-based state
machine generator.</p>

<h3 id="source-code">Source code</h3>

<p>You can find the complete, working source code with both generators here:
<a href="https://github.com/skeeto/scratch/tree/master/misc/codename.c"><strong><code class="language-plaintext highlighter-rouge">codename.c</code></strong></a>. I used <a href="https://en.wikipedia.org/wiki/Secret_Service_code_name">real US Secret Service code names</a> for
my word list. Some sample outputs:</p>

<ul>
  <li>PLASTIC HUMMINGBIRD</li>
  <li>BLACK VENUS</li>
  <li>SILENT SUNBURN</li>
  <li>BRONZE AUTHOR</li>
  <li>FADING MARVEL</li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Test cross-architecture without leaving home</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/08/21/"/>
    <id>urn:uuid:ac34f8a0-af73-4301-b21b-5a47d48e3069</id>
    <updated>2021-08-21T23:59:33Z</updated>
    <category term="c"/><category term="go"/><category term="debian"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>I like to test my software across different environments, on <a href="/blog/2020/05/15/">strange
platforms</a>, and with <a href="/blog/2018/04/13/">alternative implementations</a>. Each has its
own quirks and oddities that can shake bugs out earlier. C is particularly
good at this since it has such a wide selection of compilers and runs on
everything. For instance I count at least 7 distinct C compilers in Debian
alone. One advantage of <a href="/blog/2017/03/30/">writing portable software</a> is access to a
broader testing environment, and it’s one reason I prefer to target
standards rather than specific platforms.</p>

<p>However, I’ve long struggled with architecture diversity. My work and
testing has been almost entirely on x86, with ARM as a distant second
(Raspberry Pi and friends). Big endian hosts are particularly rare.
However, I recently learned a trick for quickly and conveniently accessing
many different architectures without even leaving my laptop: <a href="https://wiki.debian.org/QemuUserEmulation">QEMU User
Emulation</a>. Debian and its derivatives support this very well and
require almost no setup or configuration.</p>

<!--more-->

<h3 id="cross-compilation-example">Cross-compilation Example</h3>

<p>While there are many options, my main cross-testing architecture has been
PowerPC. It’s 32-bit big endian, while I’m generally working on 64-bit
little endian, which is exactly the sort of mismatch I’m going for. I use
a Debian-supplied cross-compiler and qemu-user tools. The <a href="https://en.wikipedia.org/wiki/Binfmt_misc">binfmt</a>
support is especially slick, so that’s how I usually use it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># apt install gcc-powerpc-linux-gnu qemu-user-binfmt
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">binfmt_misc</code> is a kernel module that teaches Linux how to recognize
arbitrary binary formats. For instance, there’s a Wine binfmt so that
Linux programs can transparently <code class="language-plaintext highlighter-rouge">exec(3)</code> Windows <code class="language-plaintext highlighter-rouge">.exe</code> binaries. In the
case of QEMU User Mode, binaries for foreign architectures are loaded into
a QEMU virtual machine configured in user mode. In user mode there’s no
guest operating system, and instead the virtual machine translates guest
system calls to the host operating system.</p>

<p>The first package gives me <code class="language-plaintext highlighter-rouge">powerpc-linux-gnu-gcc</code>. The prefix is the
<a href="https://wiki.debian.org/Multiarch/Tuples">architecture tuple</a> describing the instruction set and system ABI.
To try this out, I have a little test program that inspects its execution
environment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">w</span> <span class="o">=</span> <span class="s">"?"</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="n">w</span> <span class="o">=</span> <span class="s">"8"</span><span class="p">;</span>  <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="n">w</span> <span class="o">=</span> <span class="s">"16"</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">4</span><span class="p">:</span> <span class="n">w</span> <span class="o">=</span> <span class="s">"32"</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">8</span><span class="p">:</span> <span class="n">w</span> <span class="o">=</span> <span class="s">"64"</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="s">"?"</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="o">*</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)(</span><span class="kt">int</span> <span class="p">[]){</span><span class="mi">1</span><span class="p">})</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mi">0</span><span class="p">:</span> <span class="n">b</span> <span class="o">=</span> <span class="s">"big"</span><span class="p">;</span>    <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="n">b</span> <span class="o">=</span> <span class="s">"little"</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">printf</span><span class="p">(</span><span class="s">"%s-bit, %s endian</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">b</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When I run this natively on x86-64:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc test.c
$ ./a.out
64-bit, little endian
</code></pre></div></div>

<p>Running it on PowerPC via QEMU:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ powerpc-linux-gnu-gcc -static test.c
$ ./a.out
32-bit, big endian
</code></pre></div></div>

<p>Thanks to binfmt, I could execute it as though the PowerPC binary were a
native binary. With just a couple of environment variables in the right
place, I could pretend I’m developing on PowerPC — aside from emulation
performance penalties of course.</p>

<p>However, you might have noticed I pulled a sneaky on ya: <code class="language-plaintext highlighter-rouge">-static</code>. So far
what I’ve shown only works with static binaries. There’s no dynamic loader
available to run dynamically-linked binaries. Fortunately this is easy to
fix in two steps. The first step is to install the dynamic linker for
PowerPC:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># apt install libc6-powerpc-cross
</code></pre></div></div>

<p>The second is to tell QEMU where to find it since, unfortunately, it
cannot currently do so on its own.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export QEMU_LD_PREFIX=/usr/powerpc-linux-gnu
</code></pre></div></div>

<p>Now I can leave out the <code class="language-plaintext highlighter-rouge">-static</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ powerpc-linux-gnu-gcc test.c
$ ./a.out
32-bit, big endian
</code></pre></div></div>

<p>A practical example: Remember <a href="https://github.com/skeeto/binitools">binitools</a>? I’m now ready to run its
<a href="/blog/2019/01/25/">fuzz-generated test suite</a> on this cross-testing platform.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/skeeto/binitools
$ cd binitools/
$ make check CC=powerpc-linux-gnu-gcc
...
PASS: 668/668
</code></pre></div></div>

<p>Or if I’m going to be running <code class="language-plaintext highlighter-rouge">make</code> often:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export CC=powerpc-linux-gnu-gcc
$ make -e check
</code></pre></div></div>

<p>Recall: <a href="/blog/2017/08/20/">make’s <code class="language-plaintext highlighter-rouge">-e</code> flag</a> passes the environment through, so I
don’t need to pass <code class="language-plaintext highlighter-rouge">CC=...</code> on the command line each time.</p>

<p>When setting up a test suite for your own programs, consider how difficult
it would be to run the tests under customized circumstances like this. The
easier it is to run your tests, the more they’re going to be run. I’ve run
into many projects with such overly-complex test builds that even enabling
sanitizers in the tests suite was a pain, let alone cross-architecture
testing.</p>

<p>Dependencies? There might be a way to use <a href="https://wiki.debian.org/Multiarch/HOWTO">Debian’s multiarch support</a>
to install these packages, but I haven’t been able to figure it out. You
likely need to build dependencies yourself using the cross compiler.</p>

<h3 id="testing-with-go">Testing with Go</h3>

<p>None of this is limited to C (or even C++). I’ve also successfully used
this to test Go libraries and programs cross-architecture. This isn’t
nearly as important since it’s harder to write unportable Go than C — e.g.
<a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">dumb pointer tricks</a> are literally labeled “unsafe”. However, Go
(gc) trivializes cross-compilation and is statically compiled, so it’s
incredibly simple. Once you’ve installed <code class="language-plaintext highlighter-rouge">qemu-user-binfmt</code> it’s entirely
transparent:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ GOARCH=mips64 go test
</code></pre></div></div>

<p>That’s all there is to cross-platform testing. If for some reason binfmt
doesn’t work (WSL) or you don’t want to install it, there’s just one extra
step (package named <code class="language-plaintext highlighter-rouge">example</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ GOARCH=mips64 go test -c
$ qemu-mips64-static example.test
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-c</code> option builds a test binary but doesn’t run it, instead allowing
you to choose where and how to run it.</p>

<p>It even works <a href="/blog/2021/06/29/">with cgo</a> — if you’re willing to jump through the same
hoops as with C of course:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>

<span class="c">// #include &lt;stdint.h&gt;</span>
<span class="c">// uint16_t v = 0x1234;</span>
<span class="c">// char *hi = (char *)&amp;v + 0;</span>
<span class="c">// char *lo = (char *)&amp;v + 1;</span>
<span class="k">import</span> <span class="s">"C"</span>
<span class="k">import</span> <span class="s">"fmt"</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
	<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"%02x %02x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="o">*</span><span class="n">C</span><span class="o">.</span><span class="n">hi</span><span class="p">,</span> <span class="o">*</span><span class="n">C</span><span class="o">.</span><span class="n">lo</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With <code class="language-plaintext highlighter-rouge">go run</code> on x86-64:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ CGO_ENABLED=1 go run example.go
34 12
</code></pre></div></div>

<p>Via QEMU User Mode:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export CGO_ENABLED=1
$ export GOARCH=mips64
$ export CC=mips64-linux-gnuabi64-gcc
$ export QEMU_LD_PREFIX=/usr/mips64-linux-gnuabi64
$ go run example.go
12 34
</code></pre></div></div>

<p>I was pleasantly surprised how well this all works.</p>

<h3 id="one-dimension">One dimension</h3>

<p>Despite the variety, all these architectures are still “running” the same
operating system, Linux, and so they only vary on one dimension. For most
programs primarily targeting x86-64 Linux, PowerPC Linux is practically
the same thing, while x86-64 OpenBSD is foreign territory despite sharing
an architecture and ABI (<a href="/blog/2016/11/17/">System V</a>). Testing across operating
systems still requires spending the time to install, configure, and
maintain these extra hosts. That’s an article for another time.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>strcpy: a niche function you don't need</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/07/30/"/>
    <id>urn:uuid:ce6e3b54-55e4-465c-8ea2-2948a2c3ce4d</id>
    <updated>2021-07-30T19:37:48Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>The C <a href="https://man7.org/linux/man-pages/man3/strcpy.3.html"><code class="language-plaintext highlighter-rouge">strcpy</code></a> function is a common sight in typical C programs.
It’s also a source of buffer overflow defects, so linters and code
reviewers commonly recommend alternatives such as <a href="https://man7.org/linux/man-pages/man3/strncpy.3.html"><code class="language-plaintext highlighter-rouge">strncpy</code></a>
(difficult to use correctly; mismatched semantics), <a href="https://man.openbsd.org/strlcpy.3"><code class="language-plaintext highlighter-rouge">strlcpy</code></a>
(non-standard, <a href="https://nrk.neocities.org/articles/not-a-fan-of-strlcpy.html">flawed</a>), or C11’s optional <code class="language-plaintext highlighter-rouge">strcpy_s</code> (<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1967.htm">no correct or
practical implementations</a>). Besides their individual shortcomings,
these answers are incorrect. <code class="language-plaintext highlighter-rouge">strcpy</code> and friends are, at best, incredibly
niche, and the correct replacement is <code class="language-plaintext highlighter-rouge">memcpy</code>.</p>

<p>If <code class="language-plaintext highlighter-rouge">strcpy</code> is not easily replaced with <code class="language-plaintext highlighter-rouge">memcpy</code> then the code is
fundamentally wrong. Either it’s not using <code class="language-plaintext highlighter-rouge">strcpy</code> correctly or it’s
doing something dumb and should be rewritten. Highlighting such problems
is part of what makes <code class="language-plaintext highlighter-rouge">memcpy</code> such an effective replacement.</p>

<p>Note: Everything here applies just as much to <a href="https://man7.org/linux/man-pages/man3/strcat.3.html"><code class="language-plaintext highlighter-rouge">strcat</code></a> and
friends.</p>

<p>Clarification update: This article is about correctness (objective), not
safety (subjective). If the word “safety” comes to mind then you’ve missed
the point.</p>

<h3 id="common-cases">Common cases</h3>

<p>Buffer overflows arise when the destination is smaller than the source.
Safe use of <code class="language-plaintext highlighter-rouge">strcpy</code> requires <em>a priori</em> knowledge of the length of the
source string length. Usually this knowledge is the exact source string
length. If so, <code class="language-plaintext highlighter-rouge">memcpy</code> is not only a trivial substitute, it’s faster
since it will not simultaneously search for a null terminator.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">my_strdup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">strcpy</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">s</span><span class="p">);</span>  <span class="c1">// BAD</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">char</span> <span class="o">*</span><span class="nf">my_strdup_v2</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>  <span class="c1">// GOOD</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A more benign case is a static source string, i.e. trusted input.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">err</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="p">};</span>

<span class="kt">void</span> <span class="nf">set_oom</span><span class="p">(</span><span class="k">struct</span> <span class="n">err</span> <span class="o">*</span><span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">strcpy</span><span class="p">(</span><span class="n">err</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="s">"out of memory"</span><span class="p">);</span>  <span class="c1">// BAD</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The size is a compile time constant, so exploit it as such! Even more, a
static assertion (C11) can catch mistakes at compile time rather than run
time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_oom_v2</span><span class="p">(</span><span class="k">struct</span> <span class="n">err</span> <span class="o">*</span><span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">oom</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"out of memory"</span><span class="p">;</span>
    <span class="n">static_assert</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">err</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">oom</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">err</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">oom</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">oom</span><span class="p">));</span>
<span class="p">}</span>

<span class="c1">// Or using a macro:</span>

<span class="kt">void</span> <span class="nf">set_oom_v3</span><span class="p">(</span><span class="k">struct</span> <span class="n">err</span> <span class="o">*</span><span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="cp">#define OOM "out of memory"
</span>    <span class="n">static_assert</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">err</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">OOM</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">err</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">OOM</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">OOM</span><span class="p">));</span>
<span class="p">}</span>

<span class="c1">// Or assignment (implicit memcpy):</span>

<span class="kt">void</span> <span class="nf">set_oom_v4</span><span class="p">(</span><span class="k">struct</span> <span class="n">err</span> <span class="o">*</span><span class="n">err</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">err</span> <span class="n">oom</span> <span class="o">=</span> <span class="p">{</span><span class="s">"out of memory"</span><span class="p">};</span>
    <span class="o">*</span><span class="n">err</span> <span class="o">=</span> <span class="n">oom</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This covers the vast majority of cases of already-correct <code class="language-plaintext highlighter-rouge">strcpy</code>.</p>

<h3 id="less-common-cases">Less common cases</h3>

<p><code class="language-plaintext highlighter-rouge">strcpy</code> can still be correct without knowing the exact source string
length. It is enough to know its <em>upper bound</em> does not exceed the
destination length. In this example — assuming the input is guaranteed to
be null-terminated — this <code class="language-plaintext highlighter-rouge">strcpy</code> is correct without ever knowing the
source string length:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">reply</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">log</span> <span class="p">{</span>
    <span class="kt">time_t</span> <span class="n">timestamp</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">message</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
<span class="p">};</span>

<span class="kt">void</span> <span class="nf">log_reply</span><span class="p">(</span><span class="k">struct</span> <span class="n">log</span> <span class="o">*</span><span class="n">e</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">reply</span> <span class="o">*</span><span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">e</span><span class="o">-&gt;</span><span class="n">timestamp</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">strcpy</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is a rare case where <code class="language-plaintext highlighter-rouge">strncpy</code> has the right semantics. It zeros out
unused destination bytes, destroying any previous contents.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">strncpy</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">));</span>

    <span class="c1">// In this case, same as:</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">));</span>
    <span class="n">strcpy</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s not a general <code class="language-plaintext highlighter-rouge">strcpy</code> replacement because <code class="language-plaintext highlighter-rouge">strncpy</code> might not write
a null terminator. If the source string does not null-terminate within the
destination length, then neither will destination string.</p>

<p>As before, we can do better with <code class="language-plaintext highlighter-rouge">memcpy</code>!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">static_assert</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">));</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">e</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">r</span><span class="o">-&gt;</span><span class="n">message</span><span class="p">));</span>
</code></pre></div></div>

<p>This unconditionally copies 32 bytes. But doesn’t it waste time copying
bytes it won’t need? No! On modern hardware it’s far better to copy a
large, fixed number of bytes than a small, variable number of bytes. After
all, <a href="/blog/2017/10/06/">branching is expensive</a>. Searching for and handling that null
terminator has a cost. This fixed-size copy is literally two instructions
on x86-64 (output of <code class="language-plaintext highlighter-rouge">clang -march=x86-64-v3 -O3</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">vmovups</span>  <span class="nv">ymm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
<span class="nf">vmovups</span>  <span class="p">[</span><span class="nb">rdi</span> <span class="o">+</span> <span class="mi">8</span><span class="p">],</span> <span class="nv">ymm0</span>
</code></pre></div></div>

<p>It’s faster and there’s no <code class="language-plaintext highlighter-rouge">strcpy</code> to attract complaints.</p>

<h3 id="niche-cases">Niche cases</h3>

<p>So where <em>is</em> <code class="language-plaintext highlighter-rouge">strcpy</code> useful? Only where all of the following apply:</p>

<ol>
  <li>
    <p>You only know the upper bound of the source string.</p>
  </li>
  <li>
    <p>It’s undesirable to read beyond that length. Maybe storage is limited
to the exact length of the string, or the upper bound is very large so
an unconditional copy is too expensive.</p>
  </li>
  <li>
    <p>The source string is so long, and the function so hot, that it’s worth
avoiding two passes: <code class="language-plaintext highlighter-rouge">strlen</code> followed by <code class="language-plaintext highlighter-rouge">memcpy</code>.</p>
  </li>
</ol>

<p>These circumstances are very unusual which makes <code class="language-plaintext highlighter-rouge">strcpy</code> a niche function
you probably don’t need. This is the best case I can imagine, and it’s
pretty dumb:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">doc</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">id</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">body</span><span class="p">[</span><span class="mi">1L</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">];</span>
<span class="p">};</span>

<span class="c1">// Create a new document from a buffer.</span>
<span class="c1">//</span>
<span class="c1">// If body is more than 1MiB, the behavior is undefined.</span>
<span class="k">struct</span> <span class="n">doc</span> <span class="o">*</span><span class="nf">doc_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">body</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">doc</span> <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">c</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">c</span><span class="o">-&gt;</span><span class="n">id</span> <span class="o">=</span> <span class="n">id_gen</span><span class="p">();</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">strlen</span><span class="p">(</span><span class="n">body</span><span class="p">)</span> <span class="o">&lt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">c</span><span class="o">-&gt;</span><span class="n">body</span><span class="p">));</span>
        <span class="n">strcpy</span><span class="p">(</span><span class="n">c</span><span class="o">-&gt;</span><span class="n">body</span><span class="p">,</span> <span class="n">body</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re dealing with such large null-terminated strings that (2) and (3)
apply then you’re already doing something fundamentally wrong and
self-contradictory. The pointer and length should be <a href="/blog/2019/06/30/">kept and passed
together</a>. It’s especially essential for a hot function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">doc_v2</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">id</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">body</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>

<h3 id="bonus-_s-isnt-helping-you">Bonus: <code class="language-plaintext highlighter-rouge">*_s</code> isn’t helping you</h3>

<p>C11 introduced “safe” string functions as an optional “Annex K”, each
named with a <code class="language-plaintext highlighter-rouge">_s</code> suffix to its “unsafe” counterpart. Here is the
prototype for <code class="language-plaintext highlighter-rouge">strcpy_s</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">errno_t</span> <span class="nf">strcpy_s</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">s1</span><span class="p">,</span>
                 <span class="n">rsize_t</span> <span class="n">s1max</span><span class="p">,</span>
                 <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">s2</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">rsize_t</code> is a <code class="language-plaintext highlighter-rouge">size_t</code> with a “restricted” range (<code class="language-plaintext highlighter-rouge">RSIZE_MAX</code>,
probably <code class="language-plaintext highlighter-rouge">SIZE_MAX/2</code>) intended to catch integer underflows. If you
<a href="/blog/2017/07/19/">accidentally compute a negative length</a>, it will be a very large
number in unsigned form. (An indicator that <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf"><code class="language-plaintext highlighter-rouge">size_t</code> should have
originally been defined as signed</a>.) This will be outside the
restricted range, and so the operation isn’t attempted due to a likely
underflow.</p>

<p>These “safe” functions were modeled after functions of the same name in
MSVC. However, as noted, there are no practical implementations of Annex
K. The functions in MSVC have different semantics and behavior, and they
do not attempt to implement the standard.</p>

<p>Worse, they don’t even do what’s promised in <a href="https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strcpy-s-wcscpy-s-mbscpy-s?view=msvc-160">their documentation</a>.
The following program should cause a runtime-constraint violation since
<code class="language-plaintext highlighter-rouge">-1</code> is an invalid <code class="language-plaintext highlighter-rouge">rsize_t</code> in any reasonable implementation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define __STDC_WANT_LIB_EXT1__ 1
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">errno_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">strcpy_s</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s">"hello"</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">r</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With the latest MSVC as of this writing (VS 2019), this program prints “<code class="language-plaintext highlighter-rouge">0
hello</code>”. Using <code class="language-plaintext highlighter-rouge">strcpy_s</code> did not make my program any safer than had I
just used <code class="language-plaintext highlighter-rouge">strcpy</code>. If anything, it’s <em>less safe</em> due to a false sense of
security. Don’t use these functions.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>More DLL fun with w64devkit: Go, assembly, and Python</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/06/29/"/>
    <id>urn:uuid:b2c53451-b12a-4f1a-a475-6c81096c9b5a</id>
    <updated>2021-06-29T21:50:30Z</updated>
    <category term="c"/><category term="cpp"/><category term="go"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>My previous article explained <a href="/blog/2021/05/31/">how to work with dynamic-link libraries
(DLLs) using w64devkit</a>. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, <a href="/blog/2020/05/15/">w64devkit</a> is a great complement to Go and reliably
fullfills all the needs of <a href="https://golang.org/cmd/cgo/">cgo</a> — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.</p>

<h3 id="go-bootstrap-and-cgo">Go: bootstrap and cgo</h3>

<p>The primary Go implementation, confusingly <a href="https://golang.org/doc/faq#What_compiler_technology_is_used_to_build_the_compilers">named “gc”</a>, is an
<a href="/blog/2020/01/21/">incredible piece of software engineering</a>. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.</p>

<p>You will need the Go 1.4 source, <a href="https://dl.google.com/go/go1.4-bootstrap-20171003.tar.gz">go1.4-bootstrap-20171003.tar.gz</a>.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).</p>

<p>Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use <a href="/blog/2021/02/08/"><code class="language-plaintext highlighter-rouge">cmd.exe</code></a> explicitly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ &amp;&amp; cmd /c make)
</code></pre></div></div>

<p>At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" &gt;&gt;~/.profile
$ source ~/.profile
</code></pre></div></div>

<p>Not only is Go now available, so is the full power of cgo. (Including <a href="https://dave.cheney.net/2016/01/18/cgo-is-not-go">its
costs</a> if used.)</p>

<h3 id="vim-suggestions">Vim suggestions</h3>

<p>Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
<code class="language-plaintext highlighter-rouge">goimports</code> and a couple of corrections to Vim’s built-in Go support (<code class="language-plaintext highlighter-rouge">[[</code>
and <code class="language-plaintext highlighter-rouge">]]</code> navigation). The included <code class="language-plaintext highlighter-rouge">ctags</code> understands Go, so tags
navigation works the same as it does with C. <code class="language-plaintext highlighter-rouge">\i</code> saves the current
buffer, runs <code class="language-plaintext highlighter-rouge">goimports</code>, and populates the quickfix list with any errors.
Similarly <code class="language-plaintext highlighter-rouge">:make</code> invokes <code class="language-plaintext highlighter-rouge">go build</code> and, as expected, populates the
quickfix list.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code>autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="k">setlocal</span> <span class="nb">makeprg</span><span class="p">=</span><span class="k">go</span>\ build
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">silent</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">&lt;</span>leader<span class="p">&gt;</span><span class="k">i</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">update</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">cexpr</span> <span class="nb">system</span><span class="p">(</span><span class="s2">"goimports -w "</span> <span class="p">.</span> <span class="nb">expand</span><span class="p">(</span><span class="s2">"%"</span><span class="p">))</span> \<span class="p">|</span>
<span class="se">    \</span> <span class="p">:</span><span class="k">silent</span> <span class="k">edit</span><span class="p">&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">[[</span>
<span class="se">    \</span> ?^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
autocmd <span class="nb">FileType</span> <span class="k">go</span> <span class="nb">map</span> <span class="p">&lt;</span><span class="k">buffer</span><span class="p">&gt;</span> <span class="p">]]</span>
<span class="se">    \</span> /^\<span class="p">(</span>func\\<span class="p">|</span>var\\<span class="p">|</span><span class="nb">type</span>\\<span class="p">|</span><span class="k">import</span>\\<span class="p">|</span>package\<span class="p">)</span>\<span class="p">&gt;&lt;</span><span class="k">cr</span><span class="p">&gt;</span>
</code></pre></div></div>

<p>Go only comes with <code class="language-plaintext highlighter-rouge">gofmt</code> but <code class="language-plaintext highlighter-rouge">goimports</code> is just one command away, so
there’s little excuse not to have it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go install golang.org/x/tools/cmd/goimports@latest
</code></pre></div></div>

<p>Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.</p>

<h3 id="cgo-dlls">cgo DLLs</h3>

<p>The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
<code class="language-plaintext highlighter-rouge">import "C"</code>. The imported <code class="language-plaintext highlighter-rouge">C</code> object provides access to C types and
functions. Go functions marked with an <code class="language-plaintext highlighter-rouge">//export</code> comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.</p>

<p>To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Which DLL am I running?</span>
<span class="kt">int</span> <span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Generate 64 bits from a CSPRNG.</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>

<span class="c1">// Compute the Euclidean norm.</span>
<span class="kt">float</span> <span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s a C implementation which I’m calling “version 1”.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;math.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;ntsecapi.h&gt;</span><span class="cp">
</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">int</span>
<span class="nf">version</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span>
<span class="nf">rand64</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">RtlGenRandom</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">x</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">float</span>
<span class="nf">dist</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As discussed in the previous article, each function is exported using
<code class="language-plaintext highlighter-rouge">__declspec</code> so that they’re available for import. As before:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o hello1.dll hello1.c
</code></pre></div></div>

<p>Side note: This could be trivially converted into a C++ implementation
just by adding <code class="language-plaintext highlighter-rouge">extern "C"</code> to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.</p>

<p>Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>

<span class="k">import</span> <span class="s">"C"</span>
<span class="k">import</span> <span class="p">(</span>
	<span class="s">"crypto/rand"</span>
	<span class="s">"encoding/binary"</span>
	<span class="s">"math"</span>
<span class="p">)</span>

<span class="c">//export version</span>
<span class="k">func</span> <span class="n">version</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="kt">int</span> <span class="p">{</span>
	<span class="k">return</span> <span class="m">2</span>
<span class="p">}</span>

<span class="c">//export rand64</span>
<span class="k">func</span> <span class="n">rand64</span><span class="p">()</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span> <span class="p">{</span>
	<span class="k">var</span> <span class="n">buf</span> <span class="p">[</span><span class="m">8</span><span class="p">]</span><span class="kt">byte</span>
	<span class="n">rand</span><span class="o">.</span><span class="n">Read</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="n">r</span> <span class="o">:=</span> <span class="n">binary</span><span class="o">.</span><span class="n">LittleEndian</span><span class="o">.</span><span class="n">Uint64</span><span class="p">(</span><span class="n">buf</span><span class="p">[</span><span class="o">:</span><span class="p">])</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">ulonglong</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="p">}</span>

<span class="c">//export dist</span>
<span class="k">func</span> <span class="n">dist</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">)</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">C</span><span class="o">.</span><span class="n">float</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="kt">float64</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)))</span>
<span class="p">}</span>

<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note the use of C types for all arguments and return values. The <code class="language-plaintext highlighter-rouge">main</code>
function is required since this is the main package, but it will never be
called. The DLL is built like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go build -buildmode=c-shared -o hello2.dll hello2.go
</code></pre></div></div>

<p>Without the <code class="language-plaintext highlighter-rouge">-o</code> option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.</p>

<p>What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using <code class="language-plaintext highlighter-rouge">--out-implib</code>. For Go we have to handle this ourselves via
<code class="language-plaintext highlighter-rouge">gendef</code> and <code class="language-plaintext highlighter-rouge">dlltool</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def
</code></pre></div></div>

<p>The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.</p>

<h3 id="nasm-assembly-dll">NASM assembly DLL</h3>

<p>We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.</p>

<p>w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has <a href="https://elronnd.net/writ/2021-02-13_att-asm.html">friendlier syntax</a>. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">bits</span> <span class="mi">64</span>

<span class="nf">section</span> <span class="nv">.text</span>

<span class="nf">global</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nf">export</span> <span class="nb">Dl</span><span class="nv">lMainCRTStartup</span>
<span class="nl">DllMainCRTStartup:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">version</span>
<span class="nf">export</span> <span class="nv">version</span>
<span class="nl">version:</span>
	<span class="nf">mov</span> <span class="nb">eax</span><span class="p">,</span> <span class="mi">3</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nv">rand64</span>
<span class="nf">export</span> <span class="nv">rand64</span>
<span class="nl">rand64:</span>
	<span class="nf">rdrand</span> <span class="nb">rax</span>
	<span class="nf">ret</span>

<span class="nf">global</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nf">export</span> <span class="nb">di</span><span class="nv">st</span>
<span class="nl">dist:</span>
	<span class="nf">mulss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">mulss</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">addss</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
	<span class="nf">sqrtss</span> <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
	<span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">global</code> directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
<code class="language-plaintext highlighter-rouge">export</code> directive is Windows-specific and is equivalent to <code class="language-plaintext highlighter-rouge">dllexport</code> in
C.</p>

<p>Every DLL must have an entrypoint, usually named <code class="language-plaintext highlighter-rouge">DllMainCRTStartup</code>. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.</p>

<p>Here’s how to assemble and link the DLL:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o
</code></pre></div></div>

<h3 id="call-the-dlls-from-python">Call the DLLs from Python</h3>

<p>Python has a nice, built-in C interop, <code class="language-plaintext highlighter-rouge">ctypes</code>, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ctypes</span>

<span class="k">def</span> <span class="nf">load</span><span class="p">(</span><span class="n">version</span><span class="p">):</span>
    <span class="n">hello</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="sa">f</span><span class="s">"./hello</span><span class="si">{</span><span class="n">version</span><span class="si">}</span><span class="s">.dll"</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">(</span><span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">,</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_float</span><span class="p">)</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">restype</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">c_ulonglong</span>
    <span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">()</span>
    <span class="k">return</span> <span class="n">hello</span>

<span class="k">for</span> <span class="n">hello</span> <span class="ow">in</span> <span class="n">load</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">load</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"version"</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">version</span><span class="p">())</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"rand   "</span><span class="p">,</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">hello</span><span class="p">.</span><span class="n">rand64</span><span class="p">()</span><span class="si">:</span><span class="mi">016</span><span class="n">x</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"dist   "</span><span class="p">,</span> <span class="n">hello</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
</code></pre></div></div>

<p>After loading the DLL with <code class="language-plaintext highlighter-rouge">CDLL</code> the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0
</code></pre></div></div>

<p>That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to build and use DLLs on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/05/31/"/>
    <id>urn:uuid:6b64024a-6945-4bff-8226-33b9357babda</id>
    <updated>2021-05-31T02:13:40Z</updated>
    <category term="win32"/><category term="c"/><category term="cpp"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>I’ve recently been involved with a couple of discussions about Windows’
dynamic linking. One was <a href="https://begriffs.com/">Joe Nelson</a> in considering how to make
<a href="https://github.com/begriffs/libderp">libderp</a> accessible on Windows, and the other was about <a href="/blog/2020/05/15/">w64devkit</a>,
my Mingw-w64 distribution. I use these techniques so infrequently that I
need to figure it all out again each time I need it. Unfortunately there’s
a whole lot of outdated and incorrect information online which gets in the
way every time this happens. While it’s all fresh in my head, I will now
document what I know works.</p>

<p>In this article, all commands and examples are being run in the context of
w64devkit (1.8.0).</p>

<h3 id="mingw-w64">Mingw-w64</h3>

<p>If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, <code class="language-plaintext highlighter-rouge">square</code>,
that returns the square of its input (<code class="language-plaintext highlighter-rouge">square.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The header file (<code class="language-plaintext highlighter-rouge">square.h</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>To build a stripped, size-optimized DLL, <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -s -o square.dll square.c
</code></pre></div></div>

<p>Now a test program to link against it (<code class="language-plaintext highlighter-rouge">main.c</code>), which “imports” <code class="language-plaintext highlighter-rouge">square</code>
from <code class="language-plaintext highlighter-rouge">square.dll</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Linking and testing it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p>It’s that simple. Or more traditionally, using the <code class="language-plaintext highlighter-rouge">-l</code> flag:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -Os -s -L. main.c -lsquare
</code></pre></div></div>

<p>Given <code class="language-plaintext highlighter-rouge">-lxyz</code> GCC will look for <code class="language-plaintext highlighter-rouge">xyz.dll</code> in the library path.</p>

<h4 id="viewing-exported-symbols">Viewing exported symbols</h4>

<p>Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s <code class="language-plaintext highlighter-rouge">nm -D</code>, but despite what
the internet will tell you, this tool does not support DLLs. <code class="language-plaintext highlighter-rouge">objdump</code>
will print the exports as part of the “private” headers (<code class="language-plaintext highlighter-rouge">-p</code>). A bit of
<code class="language-plaintext highlighter-rouge">awk</code> can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, <code class="language-plaintext highlighter-rouge">exports.sh</code>, that composes <code class="language-plaintext highlighter-rouge">objdump</code> and
<code class="language-plaintext highlighter-rouge">awk</code> into the tool I want:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">printf</span> <span class="s1">'LIBRARY %s\nEXPORTS\n'</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span>
objdump <span class="nt">-p</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> | <span class="nb">awk</span> <span class="s1">'/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'</span>
</code></pre></div></div>

<p>Running this on <code class="language-plaintext highlighter-rouge">square.dll</code> above:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square
</code></pre></div></div>

<p>This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
<a href="https://sourceware.org/binutils/docs/binutils/def-file-format.html"><code class="language-plaintext highlighter-rouge">.def</code> file format</a> (<a href="https://www.willus.com/mingw/yongweiwu_stdcall.html">also</a>), which will be particularly
useful in a moment.</p>

<p>Mingw-w64 has a <code class="language-plaintext highlighter-rouge">gendef</code> tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square
</code></pre></div></div>

<p>Alternatively Visual Studio provides <code class="language-plaintext highlighter-rouge">dumpbin</code>. It’s not as concise as
<code class="language-plaintext highlighter-rouge">exports.sh</code> but it’s a lot less verbose than <code class="language-plaintext highlighter-rouge">objdump -p</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...
</code></pre></div></div>

<h4 id="mingw-w64-improved">Mingw-w64 (improved)</h4>

<p>You can get by without knowing anything more, which is usually enough for
those looking to support Windows as a secondary platform, even just as a
cross-compilation target. However, with a bit more work we can do better.
Imagine doing the above with a non-trivial program. GCC doesn’t know which
functions are part of the API and which are not. Obviously static
functions should not be exported, but what about non-static functions
visible between translation units (i.e. object files)?</p>

<p>For instance, suppose <code class="language-plaintext highlighter-rouge">square.c</code> also has this function which is not part
of its API but may be called by another translation unit.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>Now when I build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square
</code></pre></div></div>

<p>On the other side, when I build <code class="language-plaintext highlighter-rouge">main.c</code> how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).</p>

<p>On Windows both are solved by adding <code class="language-plaintext highlighter-rouge">__declspec</code> notation on both sides.
In <code class="language-plaintext highlighter-rouge">square.c</code> the exports are marked as <code class="language-plaintext highlighter-rouge">dllexport</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllexport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">internal_func</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{}</span>
</code></pre></div></div>

<p>In the header, it’s marked as an import:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p>The mere presence of <code class="language-plaintext highlighter-rouge">dllexport</code> tells the linker to only export those
functions marked as exports, and so <code class="language-plaintext highlighter-rouge">internal_func</code> disappears from the
exports list. Convenient!</p>

<p>On the import side, during compilation of the original program, GCC
assumed <code class="language-plaintext highlighter-rouge">square</code> wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a <a href="https://www.airs.com/blog/archives/41">PLT</a>). With
<code class="language-plaintext highlighter-rouge">dllimport</code>, GCC knows it’s an imported function and so doesn’t go through
a trampoline.</p>

<p>While generally unnecessary for the GNU toolchain, it’s good hygiene to
use <code class="language-plaintext highlighter-rouge">__declspec</code>. It’s also mandatory when using <a href="https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B">MSVC</a>, in case you
care about that as well.</p>

<h3 id="msvc">MSVC</h3>

<p>Mingw-w64-compiled DLLs will work with <code class="language-plaintext highlighter-rouge">LoadLibrary</code> out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (<code class="language-plaintext highlighter-rouge">loadlib.c</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">LoadLibrary</span><span class="p">(</span><span class="s">"square.dll"</span><span class="p">);</span>
    <span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">square</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="n">GetProcAddress</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="s">"square"</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">square</span><span class="p">(</span><span class="mi">2</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with MSVC <code class="language-plaintext highlighter-rouge">cl</code> (via <a href="/blog/2016/06/13/#visual-c"><code class="language-plaintext highlighter-rouge">vcvars.bat</code></a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo loadlib.c
$ ./loadlib
4
</code></pre></div></div>

<p>However, the MSVC linker, unlike Binutils <code class="language-plaintext highlighter-rouge">ld</code>, cannot link directly with
DLLs. It requires an <em>import library</em>. Conventionally this matches the DLL
name but has a <code class="language-plaintext highlighter-rouge">.lib</code> extension — <code class="language-plaintext highlighter-rouge">square.lib</code> in this case. The Mingw-w64
ecosystem conventionally uses <code class="language-plaintext highlighter-rouge">.dll.a</code>, as in <code class="language-plaintext highlighter-rouge">square.dll.a</code>, in order to
distinguish it from a static library, but it’s the same format. The most
convenient way to get an import library is to ask GCC to generate one at
link-time via <code class="language-plaintext highlighter-rouge">--out-implib</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c
</code></pre></div></div>

<p>Back to <code class="language-plaintext highlighter-rouge">cl</code>, just add <code class="language-plaintext highlighter-rouge">square.lib</code> as another input. You don’t actually
need <code class="language-plaintext highlighter-rouge">square.dll</code> present at link time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /Os main.c square.lib
$ ./main
4
</code></pre></div></div>

<p>What if you already have the DLL and you just need an import library? GNU
Binutils’ <code class="language-plaintext highlighter-rouge">dlltool</code> can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a <code class="language-plaintext highlighter-rouge">.def</code>
file enumerating the exports. (Why?) What luck that we have a tool for
this!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./exports.sh square.dll &gt;square.def
$ dlltool --input-def square.def --output-lib square.lib
</code></pre></div></div>

<h3 id="reversing-directions">Reversing directions</h3>

<p>Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with <code class="language-plaintext highlighter-rouge">dllexport</code>. The <code class="language-plaintext highlighter-rouge">/LD</code> (case
sensitive) is just like GCC’s <code class="language-plaintext highlighter-rouge">-shared</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">cl</code> outputs three files: <code class="language-plaintext highlighter-rouge">square.dll</code>, <code class="language-plaintext highlighter-rouge">square.lib</code>, and <code class="language-plaintext highlighter-rouge">square.exp</code>.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.</p>

<p>This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at
least for C interfaces that <a href="/blog/2023/08/27/">don’t share CRT objects</a>.</p>

<h3 id="tying-it-all-together">Tying it all together</h3>

<p>If your program is designed to be portable, those <code class="language-plaintext highlighter-rouge">__declspec</code> will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.</p>

<p>The strategy will be to mark all API functions with <code class="language-plaintext highlighter-rouge">SQUARE_API</code> and
expand that to whatever is necessary at the time. When building a library,
it will expand to <code class="language-plaintext highlighter-rouge">dllexport</code>, or default visibility on unix-likes. When
consuming a library it will expand to <code class="language-plaintext highlighter-rouge">dllimport</code>, or nothing outside of
Windows. The new <code class="language-plaintext highlighter-rouge">square.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifndef SQUARE_H
#define SQUARE_H
</span>
<span class="cp">#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>

<span class="cp">#endif
</span></code></pre></div></div>

<p>The new <code class="language-plaintext highlighter-rouge">square.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SQUARE_BUILD
#include</span> <span class="cpf">"square.h"</span><span class="cp">
</span>
<span class="n">SQUARE_API</span>
<span class="kt">long</span> <span class="nf">square</span><span class="p">(</span><span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">main.c</code> remains the same. When compiling on unix-like systems, add the
<code class="language-plaintext highlighter-rouge">-fvisibility=hidden</code> to hide all symbols by default so that this macro
can reveal them.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4
</code></pre></div></div>

<h3 id="makefile-ideas">Makefile ideas</h3>

<p>While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different <a href="/blog/2017/08/20/">Makefiles</a> per toolchain
and target. Aside from the <code class="language-plaintext highlighter-rouge">SQUARE_API</code> type of macros, the source code
can fortunately remain fairly agnostic about it.</p>

<p>Here’s what I might use as <code class="language-plaintext highlighter-rouge">NMakefile</code> for MSVC <code class="language-plaintext highlighter-rouge">nmake</code>:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>     <span class="o">=</span> cl /nologo
<span class="nv">CFLAGS</span> <span class="o">=</span> /Os

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.h square.lib</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> main.c square.lib

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> /LD <span class="nv">$(CFLAGS)</span> square.c

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="p">-</span>del /f main.exe square.dll square.lib square.exp
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nmake /nologo /f NMakefile
</code></pre></div></div>

<p>For w64devkit and cross-compiling, <code class="language-plaintext highlighter-rouge">Makefile.w64</code>, which includes
import library generation for the sake of MSVC consumers:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main.exe square.dll square.lib</span>

<span class="nl">main.exe</span><span class="o">:</span> <span class="nf">main.c square.dll square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c square.dll <span class="nv">$(LDLIBS)</span>

<span class="nl">square.dll</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nt">-Wl</span>,--out-implib,<span class="err">$</span><span class="o">(</span>@:dll<span class="o">=</span>lib<span class="o">)</span> <span class="se">\</span>
	    <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">square.lib</span><span class="o">:</span> <span class="nf">square.dll</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main.exe square.dll square.lib
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make -f Makefile.w64
</code></pre></div></div>

<p>And a <code class="language-plaintext highlighter-rouge">Makefile</code> for everyone else:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Os</span> <span class="nt">-fvisibility</span><span class="o">=</span>hidden
<span class="nv">LDFLAGS</span> <span class="o">=</span> <span class="nt">-s</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">main libsquare.so</span>

<span class="nl">main</span><span class="o">:</span> <span class="nf">main.c libsquare.so square.h</span>
	<span class="nv">$(CC)</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> main.c ./libsquare.so <span class="nv">$(LDLIBS)</span>

<span class="nl">libsquare.so</span><span class="o">:</span> <span class="nf">square.c square.h</span>
	<span class="nv">$(CC)</span> <span class="nt">-shared</span> <span class="nv">$(CFLAGS)</span> <span class="nv">$(LDFLAGS)</span> <span class="nt">-o</span> <span class="nv">$@</span> square.c <span class="nv">$(LDLIBS)</span>

<span class="nl">clean</span><span class="o">:</span>
	<span class="nb">rm</span> <span class="nt">-f</span> main libsquare.so
</code></pre></div></div>

<p>Now that I have this article, I’m glad I won’t have to figure this all out
again next time I need it!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>A guide to Windows application development using w64devkit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/03/11/"/>
    <id>urn:uuid:b04dbe3d-2e79-4afd-ad20-6ce0b232242e</id>
    <updated>2021-03-11T01:40:31Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>There’s a trend of building services where a monolithic application is
better suited, or using JavaScript and Python then being stumped by their
troublesome deployment story. This leads to solutions like <a href="https://deftly.net/posts/2017-06-01-measuring-the-weight-of-an-electron.html">bundling an
entire web browser</a> with an application, or using containers to
circumscribe <a href="https://research.swtch.com/deps">a sprawling dependency tree made of mystery meat</a>.</p>

<p>My <a href="/blog/2020/05/15/">small development distribution</a> for Windows, <a href="https://github.com/skeeto/w64devkit">w64devkit</a>,
is my own little way of pushing back against this trend where it affects
me most. Following in the footsteps of projects like <a href="https://handmadehero.org/">Handmade Hero</a>
and <a href="https://www.youtube.com/playlist?list=PLlaINRtydtNWuRfd4Ra3KeD6L9FP_tDE7">Making a Video Game from Scratch</a>, this is my guide to
no-nonsense software development using my development kit. It’s an
overview of the tooling and development workflow, and I’ve tried not to
assume too much knowledge of the reader. Being a guide rather than manual,
it is incomplete on its own, and I link to substantial external resources
to fill in the gaps. The guide is capped with a small game I wrote
entirely using my development kit, serving as a demonstration of what
sorts of things are not only possible, but quite reasonably attainable.</p>

<!--more-->

<video src="https://nullprogram.s3.amazonaws.com/asteroids/asteroids.mp4" width="600" height="600" controls="">
</video>

<p>Game repository: <a href="https://github.com/skeeto/asteroids-demo">https://github.com/skeeto/asteroids-demo</a><br />
Guide to source: <a href="https://idle.nprescott.com/2021/understanding-asteroids.html">Understanding Asteroids</a></p>

<h3 id="initial-setup">Initial setup</h3>

<p>Of course you cannot use the development kit if you don’t have it yet. Go
to the <a href="https://github.com/skeeto/w64devkit/releases">releases section</a> and download the latest release. It will be
a .zip file named <code class="language-plaintext highlighter-rouge">w64devkit-x.y.z.zip</code> where <code class="language-plaintext highlighter-rouge">x.y.z</code> is the version.</p>

<p>You will need to unzip the development kit before using it. Windows has
built-in support for .zip files, so you can either right-click to access
“Extract All…” or navigate into it as a folder then drag-and-drop the
<code class="language-plaintext highlighter-rouge">w64devkit</code> directory somewhere outside the .zip file. It doesn’t care
where it’s unzipped (aka it’s “portable”), so put it where ever is
convenient: your desktop, user profile directory, a thumb drive, etc. You
can move it later if you change your mind just so long as you’re not
actively running it. If you decide you don’t need it anymore then delete
it.</p>

<h3 id="entering-the-development-environment">Entering the development environment</h3>

<p>There is a <code class="language-plaintext highlighter-rouge">w64devkit.exe</code> in the unzipped <code class="language-plaintext highlighter-rouge">w64devkit</code> directory. This is
the easiest way to enter the development environment, and will not require
system configuration changes. This program puts the kit’s programs in the
<code class="language-plaintext highlighter-rouge">PATH</code> environment variable then runs a Bourne shell — the standard unix
shell. Aside from the text editor, this is the primary interface for
developing software. In time you may even extend this environment with
your own tools.</p>

<p>If you want an additional “terminal” window, run <code class="language-plaintext highlighter-rouge">w64devkit.exe</code> again. If
you use it a lot, you may want to create a shortcut and even pin it to
your task bar.</p>

<p>Whether on Windows or unix-like systems, when you type a command into the
system shell it uses the <code class="language-plaintext highlighter-rouge">PATH</code> environment variable to locate the actual
program to run for that command. In practice, the <code class="language-plaintext highlighter-rouge">PATH</code> variable is a
concatenation of multiple directories, and the shell searches these
directories in order. On unix-like systems, <code class="language-plaintext highlighter-rouge">PATH</code> elements are separated
by colons. However, Windows uses colons to delimit drive letters, so its
<code class="language-plaintext highlighter-rouge">PATH</code> elements are separated by semicolons.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Prepending to PATH on unix</span>
<span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>

<span class="c"># Prepending to PATH on Windows (w64devkit)</span>
<span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin;</span><span class="nv">$PATH</span><span class="s2">"</span>
</code></pre></div></div>

<p>For more advanced users: Rather than use <code class="language-plaintext highlighter-rouge">w64devkit.exe</code>, you could “Edit
environment variables for your account” and manually add w64devkit’s <code class="language-plaintext highlighter-rouge">bin</code>
directory to your <code class="language-plaintext highlighter-rouge">PATH</code>, making the tools generally available everywhere
on your system. If you’ve gone this route, you can start a Bourne shell at
any time with <code class="language-plaintext highlighter-rouge">sh -l</code>. (The <code class="language-plaintext highlighter-rouge">-l</code> option requests a login shell.)</p>

<p>Also borrowed from the unix world is the concept of a <em>home directory</em>,
specified by the <code class="language-plaintext highlighter-rouge">HOME</code> environment variable. By default this will be your
user profile directory, typically <code class="language-plaintext highlighter-rouge">C:/Users/$USER</code>. Login shells always
start in the home directory. This directory is often indicated by tilde
(<code class="language-plaintext highlighter-rouge">~</code>), and many programs automatically expand a leading tilde to the home
directory.</p>

<h3 id="shell-basics">Shell basics</h3>

<p>The shell is a command interpreter. It’s named such because <a href="https://www.youtube.com/watch?v=tc4ROCJYbm0&amp;t=4m57s">it was
originally a <em>shell</em> around the operating system kernel</a> — the user
interface to the kernel. Your system’s graphical interface — Windows
Explorer, or <code class="language-plaintext highlighter-rouge">Explorer.exe</code> — is really just a kind of shell, too. That
shell is oriented around the mouse and graphics. This is fine for some
tasks, but a keyboard-oriented command shell is far better suited for
development tasks. It’s more efficient, but more importantly its features
are composable: Complex operations and processes can be <a href="https://www.youtube.com/watch?v=bKzonnwoR2I">constructed
from</a> simple, easy-to-understand tools. Embrace it!</p>

<p>In the shell you can navigate between directories with <code class="language-plaintext highlighter-rouge">cd</code>, make
directories with <code class="language-plaintext highlighter-rouge">mkdir</code>, remove files with <code class="language-plaintext highlighter-rouge">rm</code>, regular expression text
searches with <code class="language-plaintext highlighter-rouge">grep</code>, etc. Run <code class="language-plaintext highlighter-rouge">busybox</code> to see a listing of the available
standard commands. Unfortunately there are no manual pages, but you can
access basic usage information for any command with <code class="language-plaintext highlighter-rouge">busybox CMD --help</code>.</p>

<p>Windows’ standard command shell is <code class="language-plaintext highlighter-rouge">cmd.exe</code>. Unfortunately this shell is
terrible and exists mostly for legacy compatibility. The intended
replacement is PowerShell for users who regularly use a shell. However,
PowerShell is fundamentally broken, does virtually everything incorrectly,
and manages to be even worse than <code class="language-plaintext highlighter-rouge">cmd.exe</code>. Besides, sticking to POSIX
shell conventions significantly improves build portability, and unix tool
knowledge is transferable to basically every other operating system.</p>

<p>Unix’s standard shell was the Bourne shell, <code class="language-plaintext highlighter-rouge">sh</code>. The shells in use today
are Bourne shell clones with a superset of its features. The most popular
interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist
shell) has become popular for non-interactive use (scripting). The shell
included with w64devkit is the BusyBox fork of the Almquist shell (<code class="language-plaintext highlighter-rouge">ash</code>),
closely related to dash. The Almquist shell has almost no non-interactive
features beyond the standard Bourne shell, and so as far as scripts are
concerned can be regarded as a plain Bourne shell clone. That’s why I
typically refer to it by the name <code class="language-plaintext highlighter-rouge">sh</code>.</p>

<p>However, BusyBox’s Almquist shell has interactive features much like Bash,
and Bash users should be quite comfortable. It’s not just tab-completion
but a slew of Emacs-like keybindings:</p>

<ul>
  <li><kbd>Ctrl-r</kbd>: search backwards in history</li>
  <li><kbd>Ctrl-s</kbd>: search forwards in history</li>
  <li><kbd>Ctrl-p</kbd>: previous command (Up)</li>
  <li><kbd>Ctrl-n</kbd>: next command (Down)</li>
  <li><kbd>Ctrl-a</kbd>: cursor to the beginning of line (Home)</li>
  <li><kbd>Ctrl-e</kbd>: cursor to the end of line (End)</li>
  <li><kbd>Alt-b</kbd>: cursor back one word</li>
  <li><kbd>Alt-f</kbd>: cursor forward one word</li>
  <li><kbd>Ctrl-l</kbd>: clear the screen</li>
  <li><kbd>Alt-d</kbd>: delete word after the cursor</li>
  <li><kbd>Ctrl-w</kbd>: delete the word before the cursor</li>
  <li><kbd>Ctrl-k</kbd>: delete to the end of the line</li>
  <li><kbd>Ctrl-u</kbd>: delete to the beginning of the line</li>
  <li><kbd>Ctrl-f</kbd>: cursor forward one character (Right)</li>
  <li><kbd>Ctrl-b</kbd>: cursor backward one character (Left)</li>
  <li><kbd>Ctrl-d</kbd>: delete character under the cursor (Delete)</li>
  <li><kbd>Ctrl-h</kbd>: delete character before the cursor (Backspace)</li>
</ul>

<p>Take special note of Ctrl-r, which is the most important and powerful
shortcut of the bunch. Frequent use is a good habit. Don’t mash the up
arrow to search through the command history.</p>

<p>Special note for Cygwin and MSYS2 users: the shell is aware of Windows
paths and does not present a virtual unix file system scheme. This has
important consequences for scripting, both good and bad. The shell even
supports backslash as a directory separator, though you should of course
prefer forward slashes.</p>

<h4 id="shell-customization">Shell customization</h4>

<p>Login shells (<code class="language-plaintext highlighter-rouge">-l</code>) evaluate the contents of <code class="language-plaintext highlighter-rouge">~/.profile</code> on startup. This
is your chance to customize the shell configuration, such as setting
environment variables or defining aliases and functions. For instance, if
you wanted the prompt to show the working directory in green you’d set
<code class="language-plaintext highlighter-rouge">PS1</code> in your <code class="language-plaintext highlighter-rouge">~/.profile</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">PS1</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span><span class="nb">printf</span> <span class="s1">'\x1b[33;1m\\w\x1b[0m$ '</span><span class="si">)</span><span class="s2">"</span>
</code></pre></div></div>

<p>If you find yourself using the same command sequences or set of options
again and again, you might consider putting those commands into a script,
and then installing that script somewhere on your <code class="language-plaintext highlighter-rouge">PATH</code> so that you can
run it as a new command. First make a directory to hold your scripts, say
in <code class="language-plaintext highlighter-rouge">~/bin</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/bin
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">~/.profile</code> prepend it to your <code class="language-plaintext highlighter-rouge">PATH</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/bin;</span><span class="nv">$PATH</span><span class="s2">"</span>
</code></pre></div></div>

<p>If you don’t want to start a fresh shell to try it out, then load the new
configuration in your current shell:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source</span> ~/.profile
</code></pre></div></div>

<p>Suppose you keep getting the <code class="language-plaintext highlighter-rouge">tar</code> switches mixed up and you’d like to
just have an <code class="language-plaintext highlighter-rouge">untar</code> command that does the right thing. Create a file
named <code class="language-plaintext highlighter-rouge">untar</code> or <code class="language-plaintext highlighter-rouge">untar.sh</code> in <code class="language-plaintext highlighter-rouge">~/bin</code> with these contents:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">tar</span> <span class="nt">-xaf</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>Now a command like <code class="language-plaintext highlighter-rouge">untar something.tar.gz</code> will extract the archive
contents.</p>

<p>To learn more about Bourne shell scripting, the POSIX <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html">shell command
language specification</a> is a good reference. All of the features
listed in that document are available to your shell scripts.</p>

<h3 id="text-editing">Text editing</h3>

<p>The development kit includes the powerful and popular text editor
<a href="https://www.vim.org/">Vim</a>. It takes effort to learn, but is well worth the investment.
It’s packed with features, but since you only need a small number of them
on a regular basis it’s not as daunting as it might appear. Using Vim
effectively, you will write and edit text so much more quickly than
before. That includes not just code, but prose: READMEs, documentation,
etc.</p>

<p>(The catch: Non-modal editing will forever feel frustratingly inefficient.
That’s not because you will become unpracticed at it, or even have trouble
code switching between input styles, but because you’ll now be aware how
bad it is. Ignorance is bliss.)</p>

<p>Vim includes its own tutorial for absolute beginners which you can access
with the <code class="language-plaintext highlighter-rouge">vimtutor</code> command. It will run in the console window and guide
you through the basics in about half an hour. Do not be afraid to return
to the tutorial at any time since this is the stuff you need to know by
heart.</p>

<p>When it comes time to actually use Vim to write code, you can continue
writing code via the terminal interface (<code class="language-plaintext highlighter-rouge">vim</code>), or you can run the
graphical interface (<code class="language-plaintext highlighter-rouge">gvim</code>). The latter is recommended since it has some
nice quality-of-life features, but it’s not strictly necessary. When
starting the GUI, put an ampersand (<code class="language-plaintext highlighter-rouge">&amp;</code>) on the command so that it runs in
the background. For instance this brings up the editor with two files open
but leaves the shell running in the foreground so you can continue using
it while you edit:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gvim main.c Makefile &amp;
</code></pre></div></div>

<p>Vim’s defaults are good but imperfect. Before getting started with
actually editing code you should establish at least the following minimal
configuration in <code class="language-plaintext highlighter-rouge">~/_vimrc</code>. (To understand these better, use <code class="language-plaintext highlighter-rouge">:help</code> to
jump the built-in documentation.)</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set</span> <span class="nb">hidden</span> <span class="nb">encoding</span><span class="p">=</span>utf<span class="m">-8</span> <span class="nb">shellslash</span>
<span class="k">filetype</span> plugin <span class="nb">indent</span> <span class="k">on</span>
<span class="nb">syntax</span> <span class="k">on</span>
</code></pre></div></div>

<p>The graphical interface defaults to a white background. Many people prefer
“dark mode” when editing code, so inverting this is simply a matter of
choosing a dark color scheme. Vim comes with a handful of color schemes,
around half of which have dark backgrounds. Use <code class="language-plaintext highlighter-rouge">:colorscheme</code> to change
it, and put it in your <code class="language-plaintext highlighter-rouge">~/_vimrc</code> to persist it.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">colorscheme</span> slate
</code></pre></div></div>

<p>The default graphical interface includes a menu bar and tool bar. There
are better ways to accomplish all these operations, none of which require
touching the mouse, so consider removing all that junk:</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">set</span> <span class="nb">guioptions</span><span class="p">=</span>ac
</code></pre></div></div>

<p>Finally, since the development kit is oriented around C and C++, here’s my
own entire Vim configuration for C which makes it obey my own style:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set cinoptions+=t0,l1,:0 cinkeys-=0#
</code></pre></div></div>

<p>Once you’re comfortable with the basics, the best next step is to read
<a href="https://pragprog.com/titles/dnvim2/practical-vim-second-edition/"><em>Practical Vim: Edit Text at the Speed of Thought</em></a> by Drew Neil.
It’s an opinionated guide to Vim that instills good habits. If you want
something cost-free to whet your appetite, check out <a href="https://www.moolenaar.net/habits.html"><em>Seven habits of
effective text editing</em></a>.</p>

<h3 id="writing-an-application">Writing an application</h3>

<p>We’ve established a shell and text editor. Next is the development
workflow for writing an actual application. Ultimately you will invoke a
compiler from within Vim, which will parse compiler messages and take you
directly to the parts of your source code that need attention. Before we
get that far, let’s start with the basics.</p>

<p>The classic example is the “hello world” program, which we’ll suppose is
in a file called <code class="language-plaintext highlighter-rouge">hello.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"Hello, world!"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While this development kit provides a version of the GNU compiler, <code class="language-plaintext highlighter-rouge">gcc</code>,
this guide mostly speaks of it in terms of the generic unix C compiler
name, <code class="language-plaintext highlighter-rouge">cc</code>. Unix-like systems install <code class="language-plaintext highlighter-rouge">cc</code> as an alias for the system’s
default C compiler, and w64devkit is no exception.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>This command creates <code class="language-plaintext highlighter-rouge">hello.exe</code> from <code class="language-plaintext highlighter-rouge">hello.c</code>. Since this is not (yet?)
on your <code class="language-plaintext highlighter-rouge">PATH</code>, you must invoke it via a path name (i.e. the command must
include a slash), since otherwise the shell will search for it via the
<code class="language-plaintext highlighter-rouge">PATH</code> variable. Typically this means putting <code class="language-plaintext highlighter-rouge">./</code> in front of the program
name, meaning “run the program in the current directory”. As a convenience
you do not need to include the <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./hello
</code></pre></div></div>

<p>Unlike the <code class="language-plaintext highlighter-rouge">untar</code> shell script from before, this <code class="language-plaintext highlighter-rouge">hello.exe</code> is entirely
independent of w64devkit. You can share it with anyone running Windows and
they’ll be able to execute it. There’s a little bit of runtime embedded in
the executable, but the bulk of the runtime is in the operating system
itself. I want to highlight this point because <em>most programming languages
don’t work like this</em>, or at least doing so is unnatural with lots of
compromises. The users of your software do not need to install a runtime
or other supporting software. They just run the executable you give them!</p>

<p>That executable is probably pretty small, less than 50kB — basically a
miracle by today’s standards. Sure, it’s hardly doing anything right now,
but you can add a whole lot more functionality without that executable
getting much bigger. In fact, it’s entirely unoptimized right now and
could be even smaller. Passing the <code class="language-plaintext highlighter-rouge">-Os</code> flag tells the compiler to
optimize for size and <code class="language-plaintext highlighter-rouge">-s</code> flag tells the linker to strip out unneeded
information.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Os</span> <span class="nt">-s</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>That cuts the program down to around a third of its previous size. If
necessary you can still do even better than this, but that’s outside the
scope of this guide.</p>

<p>So far the program could still be valid enough to compile but contain
obvious mistakes. The compiler can warn about many of these mistakes, and
so it’s always worth enabling these warnings. This requires two flags:
<code class="language-plaintext highlighter-rouge">-Wall</code> (“all” warnings) and <code class="language-plaintext highlighter-rouge">-Wextra</code> (extra warnings).</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>When you’re working on a program, you often don’t want optimization
enabled since it makes it more difficult to debug. However, some warnings
aren’t fired unless optimization is enabled. Fortunately there’s an
optimization level to resolve this, <code class="language-plaintext highlighter-rouge">-Og</code> (optimize for debugging).
Combine this with <code class="language-plaintext highlighter-rouge">-g3</code> to embed debug information in the program. This
will be handy later.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cc <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span> <span class="nt">-o</span> hello.exe hello.c
</code></pre></div></div>

<p>These are the compiler flags you typically want to enable while developing
your software. When you distribute it, you’d use either <code class="language-plaintext highlighter-rouge">-Os -s</code> (optimize
for size) or <code class="language-plaintext highlighter-rouge">-O3 -s</code> (optimize for speed).</p>

<h4 id="makefiles">Makefiles</h4>

<p>I mentioned running the compiler from Vim. This isn’t done directly but
via special build script called a Makefile. You invoke the <code class="language-plaintext highlighter-rouge">make</code> program
from Vim, which invokes the compiler as above. The simplest Makefile would
look like this, in a file literally named <code class="language-plaintext highlighter-rouge">Makefile</code>:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">hello.exe</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">cc</span> <span class="err">-Wall</span> <span class="err">-Wextra</span> <span class="err">-Og</span> <span class="err">-g3</span> <span class="err">-o</span> <span class="err">hello.exe</span> <span class="err">hello.c</span>
</code></pre></div></div>

<p>This tells <code class="language-plaintext highlighter-rouge">make</code> that the file named <code class="language-plaintext highlighter-rouge">hello.exe</code> is derived from another
file called <code class="language-plaintext highlighter-rouge">hello.c</code>, and the tab-indented line is the recipe for doing
so. Running the <code class="language-plaintext highlighter-rouge">make</code> command will run the compiler command if and only
if <code class="language-plaintext highlighter-rouge">hello.c</code> is newer than <code class="language-plaintext highlighter-rouge">hello.exe</code>.</p>

<p>To run <code class="language-plaintext highlighter-rouge">make</code> from Vim, use the <code class="language-plaintext highlighter-rouge">:make</code> command inside Vim. It will not
only run <code class="language-plaintext highlighter-rouge">make</code> but also capture its output in an internal buffer called
the <em>quickfix list</em>. If there is any warning or error, Vim will jump to
it. Use <code class="language-plaintext highlighter-rouge">:cn</code> (next) and <code class="language-plaintext highlighter-rouge">:cp</code> (prev) to move between issues and correct
them, or <code class="language-plaintext highlighter-rouge">:cc</code> to re-display the current issue. When you’re done fixing
the issues, run <code class="language-plaintext highlighter-rouge">:make</code> again to start the cycle over.</p>

<p>Try that now by changing the printed message and recompiling from within
Vim. Intentionally create an error (bad syntax, too many arguments, etc.)
and see what happens.</p>

<p>Makefiles are a powerful and conventional way to build C and C++ software.
Since the development kit includes the standard set of unix utilities,
it’s very easy to write portable Makefiles that work across a variety a
operating systems and environments. Your software isn’t necessarily tied
to Windows just because you’re using a Windows-based development
environment. If you want to learn how Makefiles work and how to use them
effectively, read <a href="/blog/2017/08/20/"><em>A Tutorial on Portable Makefiles</em></a>. From here on
I’ll assume you’ve read that tutorial.</p>

<p>Ultimately I’d probably write my “hello world” Makefile like so:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CC</span>      <span class="o">=</span> cc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>
<span class="nv">EXE</span>     <span class="o">=</span> .exe

<span class="nl">hello$(EXE)</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">hello.c</span> <span class="err">$(LDLIBS)</span>
</code></pre></div></div>

<p>When building a release, optimize for size or speed:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">CFLAGS</span><span class="o">=</span><span class="nt">-Os</span> <span class="nv">LDFLAGS</span><span class="o">=</span><span class="nt">-s</span>
</code></pre></div></div>

<p>This is very much a Windows-first style of Makefile, but still allows it
to be comfortably used on other systems. On Linux this <code class="language-plaintext highlighter-rouge">make</code> invocation
strips away the <code class="language-plaintext highlighter-rouge">.exe</code> extension:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">EXE</span><span class="o">=</span>
</code></pre></div></div>

<p>For a Windows-second Makefile, remove the line with <code class="language-plaintext highlighter-rouge">EXE = .exe</code>. This
allows <code class="language-plaintext highlighter-rouge">EXE</code> to come from the environment. So, for instance, I already
define the <code class="language-plaintext highlighter-rouge">EXE</code> environment variable in my w64devkit <code class="language-plaintext highlighter-rouge">~/.profile</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">EXE</span><span class="o">=</span>.exe
</code></pre></div></div>

<p>On Linux running <code class="language-plaintext highlighter-rouge">make</code> does the right thing, as does running <code class="language-plaintext highlighter-rouge">make</code> on
Windows. No special configuration required.</p>

<p>If my software is truly limited to Windows, I’m likely still interested in
supporting cross-compilation. A common convention for GNU toolchains is a
<code class="language-plaintext highlighter-rouge">CROSS</code> Makefile macro. For example:</p>

<div class="language-makefile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CROSS</span>   <span class="o">=</span>
<span class="nv">CC</span>      <span class="o">=</span> <span class="nv">$(CROSS)</span>gcc
<span class="nv">CFLAGS</span>  <span class="o">=</span> <span class="nt">-Wall</span> <span class="nt">-Wextra</span> <span class="nt">-Og</span> <span class="nt">-g3</span>
<span class="nv">LDFLAGS</span> <span class="o">=</span>
<span class="nv">LDLIBS</span>  <span class="o">=</span>

<span class="nl">hello.exe</span><span class="o">:</span> <span class="nf">hello.c</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">hello.c</span> <span class="err">$(LDLIBS)</span>
</code></pre></div></div>

<p>On Windows I just run <code class="language-plaintext highlighter-rouge">make</code>, but on Linux I’d set <code class="language-plaintext highlighter-rouge">CROSS</code> appropriately.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">CROSS</span><span class="o">=</span>x86_64-w64-mingw32-
</code></pre></div></div>

<h4 id="navigating">Navigating</h4>

<p>What happens if you’re working on a larger program and you need to jump to
the definition of a function, macro, or variable? It would be tedious to
use <code class="language-plaintext highlighter-rouge">grep</code> all the time to find definitions. The development kit includes
a solid implementation of <code class="language-plaintext highlighter-rouge">ctags</code> for building a <em>tags database</em> lists the
locations for various kinds of definitions, and Vim knows how to read this
database. Most often you’ll want to run it recursively like so:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctags <span class="nt">-R</span>
</code></pre></div></div>

<p>You can of course do this from Vim, too: <code class="language-plaintext highlighter-rouge">:!ctags -R</code></p>

<p>With the cursor over an identifier, press <code class="language-plaintext highlighter-rouge">CTRL-]</code> to jump to a definition
for that name. Use <code class="language-plaintext highlighter-rouge">:tn</code> and <code class="language-plaintext highlighter-rouge">:tp</code> to move between different definitions
(e.g. when the name is overloaded). Or if you have a tag in mind rather
than a name listed in the buffer, use the <code class="language-plaintext highlighter-rouge">:tag</code> command to jump by name.
Vim maintains a tag stack and jump list for going back and forth, like the
backward and forward buttons in a browser.</p>

<h4 id="debugging">Debugging</h4>

<p>I had mentioned that the <code class="language-plaintext highlighter-rouge">-g3</code> option embeds extra information in the
executable. This is for debuggers, and the development kit includes the
GNU Debugger, <code class="language-plaintext highlighter-rouge">gdb</code>, to help you debug your programs. To use it, invoke
GDB on your executable:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb hello.exe
</code></pre></div></div>

<p>From here you can set breakpoints and such, then run the program with
<code class="language-plaintext highlighter-rouge">start</code> or <code class="language-plaintext highlighter-rouge">run</code>, then <code class="language-plaintext highlighter-rouge">step</code> through it line by line. See <a href="https://beej.us/guide/bggdb/"><em>Beej’s Quick
Guide to GDB</em></a> for a guide. During development, always run your
program through GDB, and never exit GDB. See also: <a href="/blog/2022/06/26/"><em>Assertions should be
more debugger-oriented</em></a>.</p>

<h4 id="learning-c-and-c">Learning C and C++</h4>

<p>So far this guide hasn’t actually assumed any C knowledge. One of the best
ways to learn C is by reading the highly-regarded <a href="https://en.wikipedia.org/wiki/The_C_Programming_Language"><em>The C Programming
Language</em></a> and doing the exercises. Alternatively, cost-free options
are <a href="http://beej.us/guide/bgc/"><em>Beej’s Guide to C Programming</em></a> and <a href="https://modernc.gforge.inria.fr/"><em>Modern C</em></a> (more
advanced). You can use the development kit to go through any of these.</p>

<p>I’ve focused on C, but everything above also applies to C++. To learn C++
<a href="https://www.stroustrup.com/tour2.html"><em>A Tour of C++</em></a> is a safe bet.</p>

<h3 id="demonstration">Demonstration</h3>

<p>To illustrate how much you can do with nothing beyond than this 76MB
development kit, here’s a taste in the form of a weekend project: an
<a href="https://github.com/skeeto/asteroids-demo">Asteroids Clone for Windows</a>. That’s the game in the video at the
top of this guide.</p>

<p>The development kit doesn’t include Git so you’d need to install it
separately in order to clone the repository, but you could at least skip
that and download a .zip snapshot of the source. It has no third-party
dependencies yet it includes hardware-accelerated graphics, real-time
sound mixing, and gamepad input. Building a larger and more complex game
is much less about tooling and more about time and skill. That’s what I
mean about w64devkit being <a href="/blog/2020/09/25/">(almost) everything you need</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Well-behaved alias commands on Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/02/08/"/>
    <id>urn:uuid:d1c90d96-3696-4183-a52b-b10598a630c7</id>
    <updated>2021-02-08T20:32:45Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>Since its inception I’ve faced a dilemma with <a href="https://github.com/skeeto/w64devkit">w64devkit</a>, my
<a href="/blog/2020/09/25/">all-in-one</a> Mingw-w64 toolchain and <a href="/blog/2020/05/15/">development environment
distribution for Windows</a>. A major goal of the project is no
installation: unzip anywhere and it’s ready to go as-is. However, full
functionality requires alias commands, particularly for BusyBox applets,
and the usual solutions are neither available nor viable. It seemed that
an installer was needed to assemble this last puzzle piece. This past
weekend I finally discovered a tidy and complete solution that solves this
problem for good.</p>

<p>That solution is a small C source file, <a href="https://github.com/skeeto/w64devkit/blob/master/src/alias.c"><code class="language-plaintext highlighter-rouge">alias.c</code></a>. This article is
about why it’s necessary and how it works.</p>

<h3 id="hard-and-symbolic-links">Hard and symbolic links</h3>

<p>Some alias commands are for convenience, such as a <code class="language-plaintext highlighter-rouge">cc</code> alias for <code class="language-plaintext highlighter-rouge">gcc</code> so
that build systems need not assume any particular C compiler. Others are
essential, such as an <code class="language-plaintext highlighter-rouge">sh</code> alias for “<code class="language-plaintext highlighter-rouge">busybox sh</code>” so that it’s available
as a shell for <code class="language-plaintext highlighter-rouge">make</code>. These aliases are usually created with links, hard
or symbolic. A GCC installation might include (roughly) a symbolic link
created like so:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ln</span> <span class="nt">-s</span> gcc cc
</code></pre></div></div>

<p>BusyBox looks at its <code class="language-plaintext highlighter-rouge">argv[0]</code> on startup, and if it names an applet
(<code class="language-plaintext highlighter-rouge">ls</code>, <code class="language-plaintext highlighter-rouge">sh</code>, <code class="language-plaintext highlighter-rouge">awk</code>, etc.), it behaves like that applet. Typically BusyBox
aliases are installed as hard links to the original binary, and there’s
even a <code class="language-plaintext highlighter-rouge">busybox --install</code> to set these up. Both kinds of aliases are
cheap and effective.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ln </span>busybox sh
<span class="nb">ln </span>busybox <span class="nb">ls
ln </span>busybox <span class="nb">awk</span>
</code></pre></div></div>

<p>Unfortunately links are not supported by .zip files on Windows. They’d
need to be created by a dedicated installer. As a result, I’ve strongly
recommended that users run “<code class="language-plaintext highlighter-rouge">busybox --install</code>” at some point to
establish the BusyBox alias commands. While w64devkit works without them,
it works better with them. Still, that’s an installation step!</p>

<p>An alternative option is to simply include a full copy of the BusyBox
binary for each applet — all 150 of them — simulating hard links. BusyBox
is small, around 4kB per applet on average, but it’s not quite <em>that</em>
small. Since the .zip format doesn’t use block compression — files are
compressed individually — this duplication will appear in the .zip itself.
My 573kB BusyBox build duplicated 150 times would double the distribution
size and increase the installation footprint by 25%. It’s not worth the
cost.</p>

<p>Since .zip is so limited, perhaps I should use a different distribution
format that supports links. However, another w64devkit goal is making no
assumptions about what other tools are installed. Windows natively
supports .zip, even if that support isn’t so great (poor performance, low
composability, missing features, etc.). With nothing more than the
w64devkit .zip on a fresh, offline Windows installation, you can begin
efficiently developing professional, native applications in under a
minute.</p>

<h3 id="scripts-as-aliases">Scripts as aliases</h3>

<p>With links off the table, the next best option is a shell script. On
unix-like systems shell scripts are an effective tool for creating complex
alias commands. Unlike links, they can manipulate the argument list. For
instance, w64devkit includes a <code class="language-plaintext highlighter-rouge">c99</code> alias to invoke the C compiler
configured to use the C99 standard. To do this with a shell script:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nb">exec </span>cc <span class="nt">-std</span><span class="o">=</span>c99 <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>This prepends <code class="language-plaintext highlighter-rouge">-std=c99</code> to the argument list and passes through the rest
untouched via the Bourne shell’s special case <code class="language-plaintext highlighter-rouge">"$@"</code>. Because I used
<code class="language-plaintext highlighter-rouge">exec</code>, the shell process <em>becomes</em> the compiler in place. The shell
doesn’t hang around in the background. It’s just gone. This really quite
elegant and powerful.</p>

<p>The closest available on Windows is a .bat batch file. However, like some
other parts of DOS and Windows, the Batch language was designed as though
its designer once glimpsed at someone using a unix shell, perhaps looking
over their shoulder, then copied some of the ideas without understanding
them. As a result, it’s not nearly as useful or powerful. Here’s the Batch
equivalent:</p>

<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@cc <span class="na">-std</span><span class="o">=</span><span class="kd">c99</span> <span class="err">%</span><span class="o">*</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">@</code> is necessary because Batch prints its commands by default (Bourne
shell’s <code class="language-plaintext highlighter-rouge">-x</code> option), and <code class="language-plaintext highlighter-rouge">@</code> disables it. Windows lacks the concept of
<code class="language-plaintext highlighter-rouge">exec(3)</code>, so Batch file interpreter <code class="language-plaintext highlighter-rouge">cmd.exe</code> continues running alongside
the compiler. A little wasteful but that hardly matters. What does matter
though is that <code class="language-plaintext highlighter-rouge">cmd.exe</code> doesn’t behave itself! If you, say, Ctrl+C to
cancel compilation, you will get the infamous “Terminate batch job (Y/N)?”
prompt which interferes with other programs running in the same console.
The so-called “batch” script isn’t a batch job at all: It’s interactive.</p>

<p>I tried to use Batch files for BusyBox applets, but this issue came up
constantly and made this approach impractical. Nearly all BusyBox applets
are non-interactive, and lots of things break when they aren’t. Worst of
all, you can easily end up with layers of <code class="language-plaintext highlighter-rouge">cmd.exe</code> clobbering each other
to ask if they should terminate. It was frustrating.</p>

<p>The prompt is hardcoded in <code class="language-plaintext highlighter-rouge">cmd.exe</code> and cannot be disabled. Since so much
depends on <code class="language-plaintext highlighter-rouge">cmd.exe</code> remaining exactly the way it is, Microsoft will never
alter this behavior either. After all, that’s why they made PowerShell a
new, separate tool.</p>

<p>Speaking of PowerShell, could we use that instead? Unfortunately not:</p>

<ol>
  <li>
    <p>It’s installed by default on Windows, but is not necessarily enabled.
One of my own use cases for w64devkit involves systems where PowerShell
is disabled by policy. A common policy is it can be used interactively
but not run scripts (“Running scripts is disabled on this system”).</p>
  </li>
  <li>
    <p>PowerShell is not a first class citizen on Windows, and will likely
never be. Even under the friendliest policy it’s not normally possible
to put a PowerShell script on the <code class="language-plaintext highlighter-rouge">PATH</code> and run it by name. (I’m sure
there are ways to make this work via system-wide configuration, but
that’s off the table.)</p>
  </li>
  <li>
    <p>Everything in PowerShell is broken. For example, it does not support
input redirection with files, and instead you must use the <code class="language-plaintext highlighter-rouge">cat</code>-like
command, <code class="language-plaintext highlighter-rouge">Get-Content</code>, to pipe file contents. However, <code class="language-plaintext highlighter-rouge">Get-Content</code>
translates its input and quietly damages your data. There is no way to
disable this “feature” in the version of PowerShell that ships with
Windows, meaning it cannot accomplish the simplest of tasks. This is
just one of many ways that PowerShell is broken beyond usefulness.</p>
  </li>
</ol>

<p>Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts
are still not first class citizens since Windows doesn’t know what to do
with them. Fixing would require system-wide configuration, antithetical to
the philosophy of the project.</p>

<h3 id="solution-compiled-shell-scripts">Solution: compiled shell “scripts”</h3>

<p>My working solution is inspired by an insanely clever hack used by my
favorite media player, <a href="https://mpv.io/">mpv</a>. The Windows build is strange at first
glance, containing two binaries, <code class="language-plaintext highlighter-rouge">mpv.exe</code> (large) and <code class="language-plaintext highlighter-rouge">mpv.com</code> (tiny).
Is that COM as in <a href="/blog/2014/12/09/">an old-school 16-bit DOS binary</a>? No, that’s just
a trick that works around a Windows limitation.</p>

<p>The Windows technology is broken up into subsystems. Console programs run
in the Console subsystem. Graphical programs run in the Windows subsystem.
<a href="/blog/2017/11/30/">The original WSL</a> was a subsystem. Unfortunately this design means
that a program must statically pick a subsystem, hardcoded into the binary
image. The program cannot select a subsystem dynamically. For example,
this is why Java installations have both <code class="language-plaintext highlighter-rouge">java.exe</code> and <code class="language-plaintext highlighter-rouge">javaw.exe</code>, and
Emacs has <code class="language-plaintext highlighter-rouge">emacs.exe</code> and <code class="language-plaintext highlighter-rouge">runemacs.exe</code>. Different binaries for different
subsystems.</p>

<p>On Linux, a program that wants to do graphics just talks to the Xorg
server or Wayland compositor. It can dynamically choose to be a terminal
application or a graphical application. Or even both at once. This is
exactly the behavior of <code class="language-plaintext highlighter-rouge">mpv</code>, and it faces a dilemma on Windows: With
subsystems, how can it be both?</p>

<p>The trick is based on the environment variable <code class="language-plaintext highlighter-rouge">PATHEXT</code> which tells
Windows how to prioritize executables with the same base name but
different file extensions. If I type <code class="language-plaintext highlighter-rouge">mpv</code> and it finds both <code class="language-plaintext highlighter-rouge">mpv.exe</code> and
<code class="language-plaintext highlighter-rouge">mpv.com</code>, which binary will run? It will be the first listed in
<code class="language-plaintext highlighter-rouge">PATHEXT</code>, and by default that starts with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PATHEXT=.COM;.EXE;.BAT;...
</code></pre></div></div>

<p>So it will run <code class="language-plaintext highlighter-rouge">mpv.com</code>, which is actually a plain old <a href="https://wiki.osdev.org/PE">PE+</a> <code class="language-plaintext highlighter-rouge">.exe</code>
in disguise. The Windows subsystem <code class="language-plaintext highlighter-rouge">mpv.exe</code> gets the shortcut and file
associations while Console subsystem <code class="language-plaintext highlighter-rouge">mpv.com</code> catches command line
invocations and serves as console liaison as it invokes the real
<code class="language-plaintext highlighter-rouge">mpv.exe</code>. Ingenious!</p>

<p>I realized I can pull a similar trick to create command aliases — not the
<code class="language-plaintext highlighter-rouge">.com</code> trick, but the miniature flagger program. If only I could compile
each of those Batch files to tiny, well-behaved <code class="language-plaintext highlighter-rouge">.exe</code> files so that it
wouldn’t rely on the badly-behaved <code class="language-plaintext highlighter-rouge">cmd.exe</code>…</p>

<h4 id="tiny-c-programs">Tiny C programs</h4>

<p>Years ago <a href="/blog/2016/01/31/">I wrote about tiny, freestanding Windows executables</a>.
That research paid off here since that’s exactly what I want. The alias
command program need only manipulate its command line, invoke another
program, then wait for it to finish. This doesn’t require the C library,
just a handful of <code class="language-plaintext highlighter-rouge">kernel32.dll</code> calls. My alias command programs can be
so small that would no longer matter that I have 150 of them, and I get
complete control over their behavior.</p>

<p>To compile, I use <code class="language-plaintext highlighter-rouge">-nostdlib</code> and <code class="language-plaintext highlighter-rouge">-ffreestanding</code> to disable all system
libraries, <code class="language-plaintext highlighter-rouge">-lkernel32</code> to pull that one back in, <code class="language-plaintext highlighter-rouge">-Os</code> (optimize for
size), and <code class="language-plaintext highlighter-rouge">-s</code> (strip) all to make the result as small as possible.</p>

<p>I don’t want to write a little program for each alias command. Instead
I’ll use a couple of C defines, <code class="language-plaintext highlighter-rouge">EXE</code> and <code class="language-plaintext highlighter-rouge">CMD</code>, to inject the target
command at compile time. So this Batch file:</p>

<div class="language-bat highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@target <span class="kd">arg1</span> <span class="kd">arg2</span> <span class="err">%</span><span class="o">*</span>
</code></pre></div></div>

<p>Is equivalent to this alias compilation:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc <span class="nt">-DEXE</span><span class="o">=</span><span class="s2">"target.exe"</span> <span class="nt">-DCMD</span><span class="o">=</span><span class="s2">"target arg1 arg2"</span> <span class="se">\</span>
    <span class="nt">-s</span> <span class="nt">-Os</span> <span class="nt">-nostdlib</span> <span class="nt">-ffreestanding</span> <span class="nt">-o</span> alias.exe alias.c <span class="nt">-lkernel32</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">EXE</code> string is the actual <em>module</em> name, so the <code class="language-plaintext highlighter-rouge">.exe</code> extension is
required. The <code class="language-plaintext highlighter-rouge">CMD</code> string replaces the first complete token of the
command line string (think <code class="language-plaintext highlighter-rouge">argv[0]</code>) and may contain arbitrary additional
arguments (e.g. <code class="language-plaintext highlighter-rouge">-std=c99</code>). Both are handled as wide strings (<code class="language-plaintext highlighter-rouge">L"..."</code>)
since the alias program uses the wide Win32 API in order to be fully
transparent. Though unfortunately at this time it makes no difference: All
currently aliased programs use the “ANSI” API since the underlying C and
C++ standard libraries only use the ANSI API. (As far as I know, nobody
has ever written fully-functional C and C++ standard libraries for
Windows, not even Microsoft.)</p>

<p>You might wonder why the heck I’m gluing strings together for the
arguments. These will need to be parsed (word split, etc.) by someone
else, so shouldn’t I construct an argv array instead? That’s not how it
works on Windows: Programs receive a flat command string and are expected
to parse it themselves following <a href="https://docs.microsoft.com/en-us/previous-versions/17w5ykft(v=vs.85)">the format specification</a>. When
you write a C program, the C runtime does this for you to provide the
usual argv array.</p>

<p>This is upside down. The caller creating the process already has arguments
split into an argv array — or something like it — but Win32 requires the
caller to encode the argv array as a string following a special format so
that the recipient can immediately decode it. Why marshaling rather than
pass structured data in the first place? Why does Win32 only supply a
decoder (<a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw"><code class="language-plaintext highlighter-rouge">CommandLineToArgv</code></a>) and not an encoder (e.g. the missing
<code class="language-plaintext highlighter-rouge">ArgvToCommandLine</code>)? Hey, I don’t make the rules; I just have to live
with them.</p>

<p>You can look at the original source for the details, but the summary is
that I supply my own <code class="language-plaintext highlighter-rouge">xstrlen()</code>, <code class="language-plaintext highlighter-rouge">xmemcpy()</code>, and partial Win32 command
line parser — just enough to identify the first token, even if that token
is quoted. It glues the strings together, calls <code class="language-plaintext highlighter-rouge">CreateProcessW</code>, waits
for it to exit (<code class="language-plaintext highlighter-rouge">WaitForSingleObject</code>), retrieves the exit code
(<code class="language-plaintext highlighter-rouge">GetExitCodeProcess</code>), and exits with the same status. (The stuff that
comes for free with <code class="language-plaintext highlighter-rouge">exec(3)</code>.)</p>

<p>This all compiles to a 4kB executable, mostly padding, which is small
enough for my purposes. These compress to an acceptable 1kB each in the
.zip file. Smaller would be nicer, but this would require at minimum a
custom linker script, and even smaller would require hand-crafted
assembly.</p>

<p>This lingering issue solved, w64devkit now works better than ever. The
<code class="language-plaintext highlighter-rouge">alias.c</code> source is included in the kit in case you need to make any of
your own well-behaved alias commands.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Single-primitive authenticated encryption for fun</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2021/01/30/"/>
    <id>urn:uuid:92013b12-7f4b-4175-8d19-93520798a919</id>
    <updated>2021-01-30T03:39:10Z</updated>
    <category term="crypto"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Just as a fun exercise, I designed and implemented from scratch a
standalone, authenticated encryption tool, including key derivation with
stretching, using a single cryptographic primitive. Or, more specifically,
<em>half of a primitive</em>. That primitive is the encryption function of the
<a href="https://en.wikipedia.org/wiki/XXTEA">XXTEA block cipher</a>. The goal was to pare both design and
implementation down to the bone without being broken in practice — <em>I
hope</em> — and maybe learn something along the way. This article is the tour
of my design. Everything here will be nearly the opposite of the <a href="https://latacora.micro.blog/2018/04/03/cryptographic-right-answers.html">right
answers</a>.</p>

<p>The <a href="https://github.com/skeeto/xxtea/tree/v0.1">tool itself is named <strong>xxtea</strong></a> (lowercase), and it’s supported
on all unix-like and Windows systems. It’s trivial to compile, <a href="https://github.com/skeeto/w64devkit">even on
the latter</a>. The code should be easy to follow from top to bottom,
with commentary about specific decisions along the way, though I’ll quote
the most important stuff inline here.</p>

<p>The command line options <a href="/blog/2020/08/01/">follow the usual conventions</a>. The two
modes of operation are encrypt (<code class="language-plaintext highlighter-rouge">-E</code>) and decrypt (<code class="language-plaintext highlighter-rouge">-D</code>). It defaults to
using standard input and standard output so it works great in pipelines.
Supplying <code class="language-plaintext highlighter-rouge">-o</code> sends output elsewhere (automatically deleted if something
goes wrong), and the optional positional argument indicates an alternate
input source.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>usage: xxtea &lt;-E|-D&gt; [-h] [-o FILE] [-p PASSWORD] [FILE]

examples:
    $ xxtea -E -o file.txt.xxtea file.txt
    $ xxtea -D -o file.txt file.txt.xxtea
</code></pre></div></div>

<p>If no password is provided (<code class="language-plaintext highlighter-rouge">-p</code>), it prompts for a <a href="/blog/2020/05/04/">UTF-8-encoded
password</a>. Of course it’s not normally a good idea to supply a
password via command line argument, but it’s been useful for testing.</p>

<h3 id="xxtea-block-cipher">XXTEA block cipher</h3>

<p>TEA stands for <em>Tiny Encryption Algorithm</em> and XXTEA is the second attempt
at fixing weaknesses in the cipher — with partial success. The remaining
issues should not be an issue for this particular application. XXTEA
supports a variable block size, but I’ve hardcoded my implementation to a
128-bit block size, along with some unrolling. I’ve also discarded the
unneeded decryption function. There are no data-dependent lookups or
branches so it’s immune to speculation attacks.</p>

<p>XXTEA operates on 32-bit words and has a 128-bit key, meaning both block
and key are four words apiece. My implementation is about a dozen lines
long. Its prototype:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Encrypt a 128-bit block using 128-bit key</span>
<span class="kt">void</span> <span class="nf">xxtea128_encrypt</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">key</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span> <span class="kt">uint32_t</span> <span class="n">block</span><span class="p">[</span><span class="mi">4</span><span class="p">]);</span>
</code></pre></div></div>

<p>All cryptographic operations are built from this function. Another way to
think about it is that it accepts two 128-bit inputs and returns a 128-bit
result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uint128 r = f(uint128 key, uint128 block);
</code></pre></div></div>

<p>Tuck that away in the back of your head since this will be important
later.</p>

<h3 id="encryption">Encryption</h3>

<p>If I tossed the decryption function, how are messages decrypted? I’m sure
many have already guessed: XXTEA will be used in <em>counter mode</em>, or CTR
mode. Rather than encrypt the plaintext directly, encrypt a 128-bit block
counter and treat it like a stream cipher. The message is XORed with the
encrypted counter values for both encryption and decryption.</p>

<ul>
  <li>Only half the cipher is needed.</li>
  <li>No padding scheme is necessary. With other block modes, if message
lengths may not be exactly a multiple of the block size then you need
some scheme for padding the last block.</li>
</ul>

<p>A 128-bit increment with 32-bit limbs is easy:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">increment</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ctr</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span>
<span class="p">{</span>
    <span class="cm">/* 128-bit increment, first word changes fastest */</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!++</span><span class="n">ctr</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">if</span> <span class="p">(</span><span class="o">!++</span><span class="n">ctr</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="k">if</span> <span class="p">(</span><span class="o">!++</span><span class="n">ctr</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span> <span class="o">++</span><span class="n">ctr</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In xxtea, words are always marshalled in little endian byte order (least
significant byte first). With the first word as the least significant
limb, the entire 128-bit counter is itself little endian.</p>

<p>The counter doesn’t start at zero, but at some randomly-selected 128-bit
nonce called the <em>initialization vector</em> (IV), wrapping around to zero if
necessary (incredibly unlikely). The IV will be included with the message
in the clear. This nonce allows one key (password) to be used with
multiple messages, as they’ll all be encrypted using different,
randomly-chosen regions of an enormous keystream. It also provides
<em>semantic security</em>: encrypt the same file more than once and the
ciphertext will always be completely different.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="cm">/* ... */</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">cover</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="n">ctr</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">ctr</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">ctr</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">ctr</span><span class="p">[</span><span class="mi">3</span><span class="p">]};</span>
    <span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">cover</span><span class="p">);</span>
    <span class="n">block</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">cover</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="n">block</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">^=</span> <span class="n">cover</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">block</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span> <span class="o">^=</span> <span class="n">cover</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">block</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span> <span class="o">^=</span> <span class="n">cover</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span>
    <span class="n">increment</span><span class="p">(</span><span class="n">ctr</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="hash-function">Hash function</h3>

<p>That’s encryption, but there’s still a matter of <em>authentication</em> and <em>key
derivation function</em> (KDF). To deal with both I’ll need to devise a hash
function. Since I’m only using the one primitive, somehow I need to build
a hash function from a block cipher. Fortunately there’s a tool for doing
just that: the <a href="https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction">Merkle–Damgård construction</a>.</p>

<p>Recall that <code class="language-plaintext highlighter-rouge">xxtea128_encrypt</code> accepts two 128-bit inputs and returns a
128-bit result. In other words, it <em>compresses</em> 256 bits into 128 bits: a
compression function. The two 128-bit inputs are cryptographically
combined into one 128-bit result. I can repeat this operation to fold an
arbitrary number of 128-bit inputs into a 128-bit hash result.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">input</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">hash</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>
<span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">input</span> <span class="o">+</span>  <span class="mi">0</span><span class="p">,</span> <span class="n">hash</span><span class="p">);</span>
<span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">input</span> <span class="o">+</span>  <span class="mi">4</span><span class="p">,</span> <span class="n">hash</span><span class="p">);</span>
<span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">input</span> <span class="o">+</span>  <span class="mi">8</span><span class="p">,</span> <span class="n">hash</span><span class="p">);</span>
<span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">input</span> <span class="o">+</span> <span class="mi">12</span><span class="p">,</span> <span class="n">hash</span><span class="p">);</span>
<span class="c1">// ...</span>
</code></pre></div></div>

<p>Note how the input is the key, not the block. The hash state is repeatedly
encrypted using the hash inputs as the key, mixing hash state and input.
When the input is exhausted, that block is the result. Sort of.</p>

<p>I used zero for the initial hash state in my example, but it will be more
challenging to attack if the starting input is something random. <a href="/blog/2017/09/15/">Like
Blowfish</a>, in xxtea I chose the first 128 bits of the decimals
of pi:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xxtea128_hash_init</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span>
<span class="p">{</span>
    <span class="cm">/* first 32 hexadecimal digits of pi */</span>
    <span class="n">ctx</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x243f6a88</span><span class="p">;</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x85a308d3</span><span class="p">;</span>
    <span class="n">ctx</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x13198a2e</span><span class="p">;</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x03707344</span><span class="p">;</span>
<span class="p">}</span>

<span class="cm">/* Mix one block into the hash state. */</span>
<span class="kt">void</span>
<span class="nf">xxtea128_hash_update</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">block</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span>
<span class="p">{</span>
    <span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">ctx</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There are still a couple of problems. First, what if the input isn’t a
multiple of the block size? This time I <em>do</em> need a padding scheme to fill
out that last block. In this case I pad it with bytes where each byte is
the number of padding bytes. For instance, <code class="language-plaintext highlighter-rouge">helloworld</code> becomes, roughly
speaking, <code class="language-plaintext highlighter-rouge">helloworld666666</code>.</p>

<p>That creates a different problem: This will have the same hash result as
an input that actually ends with these bytes. So the second rule is that
there is always a padding block, even if that block is 100% padding.</p>

<p>Another problem is that the Merkle–Damgård construction is prone to
<em>length-extension attacks</em>. Anyone can take my hash result and continue
appending additional data without knowing what came before. If I’m using
this hash to authenticate the ciphertext, someone could, for example, use
this attack to append arbitrary data to the end of messages.</p>

<p>Some important hash functions, such as the most common forms of SHA-2, are
vulnerable to length-extension attacks. Keeping this issue in mind, I
could address it later using HMAC, but I have an idea for nipping this in
the bud now. Before mixing the padding block into the hash state, I swap
the two middle words:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Append final raw-byte block to hash state. */</span>
<span class="kt">void</span>
<span class="nf">xxtea128_hash_final</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">len</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tmp</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="mi">16</span><span class="o">-</span><span class="n">len</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">k</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">loadu32</span><span class="p">(</span><span class="n">tmp</span> <span class="o">+</span>  <span class="mi">0</span><span class="p">),</span> <span class="n">loadu32</span><span class="p">(</span><span class="n">tmp</span> <span class="o">+</span>  <span class="mi">4</span><span class="p">),</span>
        <span class="n">loadu32</span><span class="p">(</span><span class="n">tmp</span> <span class="o">+</span>  <span class="mi">8</span><span class="p">),</span> <span class="n">loadu32</span><span class="p">(</span><span class="n">tmp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">),</span>
    <span class="p">};</span>
    <span class="cm">/* swap middle words to break length extension attacks */</span>
    <span class="kt">uint32_t</span> <span class="n">swap</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">ctx</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">ctx</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">swap</span><span class="p">;</span>
    <span class="n">xxtea128_encrypt</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">ctx</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This operation “ties off” the last block so that the hash can’t be
extended with more input. <em>Or so I hope.</em> This is my own invention, and so
it may not actually work right. Again, this is for fun and learning!</p>

<p><strong>Update</strong>: Aristotle Pagaltzis pointed out that when these two words are
identical the hash result will be unchanged, leaving it vulnerable to
length extension attack. This occurs about once every 2<sup>32</sup>
messages, which is far too small a security margin.</p>

<h4 id="caveats">Caveats</h4>

<p>Despite all that care, there are still two more potential weaknesses.</p>

<p>First, XXTEA was never designed to be used with the Merkle–Damgård
construction. I assume attackers can modify files I will decrypt, and so
the hash input is usually and mostly under control of attackers, meaning
they control the cipher key. Ciphers are normally designed assuming the
key is not under hostile control. This might be vulnerable to related-key
attacks.</p>

<p>As will be discussed below, I use this custom hash function in two ways.
In one the input is not controlled by attackers, so this is a non-issue.
In the second, the hash state is completely unknown to the attacker before
they control the input, which I believe mitigates any issues.</p>

<p>Second, a 128-bit hash state is a bit small these days. For very large
inputs, the chance of <a href="/blog/2019/07/22/">collision via the birthday paradox</a> is a
practical issue.</p>

<p>In xxtea, digests are only computed over a few megabytes of input at a time
at most, even when encrypting giant files, so a 128-bit state should be
fine.</p>

<h3 id="key-derivation">Key derivation</h3>

<p>The user will supply a password and somehow I need to turn that into a
128-bit key.</p>

<ol>
  <li>What if the password is shorter than 128 bits?</li>
  <li>What if the password is longer than 128 bits?</li>
  <li>It’s safer for the cipher if the raw password isn’t used directly.</li>
  <li>I’d like offline, brute force attacks to be expensive.</li>
</ol>

<p>The first three can be resolved by running the passphrase through the hash
function, using it as key derivation function. What about the last item?
Rather than hash the password once, I concatenate it, including null
terminator, repeatedly until it reaches a certain number of bytes
(hardcoded to 64 MiB, see <code class="language-plaintext highlighter-rouge">COST</code>), and hash that. That’s a computational
workload that attackers must repeat when guessing passwords.</p>

<p>To avoid timing attacks based on the password length, I precompute all
possible block arrangements before starting the hash — all the different
ways the password might appear concatenated across 16-byte blocks. Blocks
may be redundantly computed if necessary to make this part constant time.
The hash is fed entirely from these precomputed blocks.</p>

<p>To defend against rainbow tables and the like — as well as make it harder
to attack other parts of the message construction — the initialization
vector is used as a salt, fed into the hash before the password
concatenation.</p>

<p>Unfortunately this KDF isn’t <em>memory-hard</em>, and attackers can use economy
of scale to strengthen their attacks (GPUs, custom hardware). However, a
memory-hard KDF requires lots of memory to compute the key, making memory
an expensive and limiting factor for attackers. Memory-hard KDFs are
complex and difficult to design, and I made the trade-off for simplicity.</p>

<h3 id="authentication">Authentication</h3>

<p>When I say the encryption is <em>authenticated</em> I mean that it should not be
possible for anyone to tamper with the ciphertext undetected without
already knowing the key. This is typically accomplished by computing a
keyed hash digest and appending it to the message, <em>message authentication
code</em> (MAC). Since it’s keyed, only someone who knows the key can compute
the digest, and so attackers can’t spoof the MAC.</p>

<p>This is where length-extension attacks come into play: With an improperly
constructed MAC, an attacker could append input without knowing the key.
Fortunately my hash function isn’t vulnerable to length-extension attacks!</p>

<p>An alternative is to use an authenticated block mode such as <a href="https://en.wikipedia.org/wiki/Galois/Counter_Mode">GCM</a>,
which is still CTR mode at its core. Unfortunately, this is complicated,
and, unlike plain CTR, it would take me a long time to convince myself I
got it right. So instead I used CTR mode and my hash function in a
straightforward way.</p>

<p>At this point there’s a question of what exactly you input into the hash
function. Do you hash the plaintext or do you hash the ciphertext? It’s
tempting to do the former since it’s (generally) not available to
attackers, and would presumably make it harder to attack. This is a
mistake. Always compute the MAC over the ciphertext, a.k.a. encrypt then
authenticate.</p>

<p>This is the called <a href="https://moxie.org/2011/12/13/the-cryptographic-doom-principle.html">the Doom Principle</a>. Computing the MAC on the
plaintext means that recipients must decrypt untrusted ciphertext before
authenticating it. This is bad because messages should be authenticated
before decryption. So that’s exactly what xxtea does. It also happens to
be the simplest option.</p>

<p>We have a hash function, but to compute a MAC we need a keyed hash
function. Again, I do the simplest thing that I believe isn’t broken:
concatenate the key with the ciphertext. Or more specifically:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MAC = hash(key || ctr || ciphertext)
</code></pre></div></div>

<p><strong>Update</strong>: <a href="https://lists.sr.ht/~skeeto/public-inbox/%3C5b3ef28a-c8b7-2835-9a56-6968aca5606c%40gmail.com%3E">Dimitrije Erdeljan explains why this is broken</a> and
how to fix it. Given a valid MAC, attackers can forge arbitrary messages.</p>

<p>The counter is because xxtea uses chunked authentication with one megabyte
chunks. It can authenticate a chunk at a time, which allows it to decrypt,
with authentication, arbitrary amounts of ciphertext in a fixed amount of
memory. The worst that can happen is truncation between chunks — an
acceptable trade-off. The counter ensures each chunk MAC is uniquely
keyed, that they appear in order.</p>

<p>It’s also important to note that the counter is appended <em>after</em> the key.
The counter is under hostile control — they can choose the IV — and having
the key there first means they have no information about the hash state.</p>

<p>All chunks are one megabyte except for the last chunk, which is always
shorter, signaling the end of the message. It may even be just a MAC and
zero-length ciphertext. This avoids nasty issues with parsing potentially
unauthenticated length fields and whatnot. Just stop successfully at the
first short, authenticated chunk.</p>

<p>Some will likely have spotted it, but a potential weakness is that I’m
using the same key for both encryption and authentication. These are
normally two different keys. This is disastrous in certain cases <a href="https://blog.cryptographyengineering.com/2013/02/15/why-i-hate-cbc-mac/">like
CBC-MAC</a>, but I believe it’s alright here. It would be easy to
compute a separate MAC key, but I opted for simple.</p>

<h3 id="file-format">File format</h3>

<p>In my usual style, encrypted files have no distinguishing headers or
fields. They just look like a random block of data. A file begins with the
16-byte IV, then a sequence of zero or more one megabyte chunks, ending
with a short chunk. It’s indistinguishable from <code class="language-plaintext highlighter-rouge">/dev/random</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[IV][lMiB || MAC][1MiB || MAC][&lt;1 MiB || MAC]
</code></pre></div></div>

<p>If the user types the incorrect password, it will be discovered when
authenticating the first chunk (read: immediately). This saves on a
dedicated check at the beginning of the file, though it means it’s not
possible to distinguish between a bad password and a modified file.</p>

<p>I know my design has weaknesses as a result of artificial, self-imposed
constraints and deliberate trade-offs, but I’m curious if I’ve made any
glaring mistakes with practical consequences.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>State machines are wonderful tools</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/12/31/"/>
    <id>urn:uuid:c93d7a7b-6ae0-4b7e-afa6-424ef40b9d9c</id>
    <updated>2020-12-31T22:48:13Z</updated>
    <category term="compsci"/><category term="c"/><category term="python"/><category term="lua"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25601821">on Hacker News</a>.</em></p>

<p>I love when my current problem can be solved with a state machine. They’re
fun to design and implement, and I have high confidence about correctness.
They tend to:</p>

<ol>
  <li>Present <a href="/blog/2018/06/10/">minimal, tidy interfaces</a></li>
  <li>Require few, fixed resources</li>
  <li>Hold no opinions about input and output</li>
  <li>Have a compact, concise implementation</li>
  <li>Be easy to reason about</li>
</ol>

<p>State machines are perhaps one of those concepts you heard about in
college but never put into practice. Maybe you use them regularly.
Regardless, you certainly run into them regularly, from <a href="https://swtch.com/~rsc/regexp/">regular
expressions</a> to traffic lights.</p>

<!--more-->

<h3 id="morse-code-decoder-state-machine">Morse code decoder state machine</h3>

<p>Inspired by <a href="https://possiblywrong.wordpress.com/2020/11/21/among-us-morse-code-puzzle/">a puzzle</a>, I came up with this deterministic state
machine for decoding <a href="https://en.wikipedia.org/wiki/Morse_code">Morse code</a>. It accepts a dot (<code class="language-plaintext highlighter-rouge">'.'</code>), dash
(<code class="language-plaintext highlighter-rouge">'-'</code>), or terminator (0) one at a time, advancing through a state
machine step by step:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">morse_decode</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">int</span> <span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">t</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mh">0x03</span><span class="p">,</span> <span class="mh">0x3f</span><span class="p">,</span> <span class="mh">0x7b</span><span class="p">,</span> <span class="mh">0x4f</span><span class="p">,</span> <span class="mh">0x2f</span><span class="p">,</span> <span class="mh">0x63</span><span class="p">,</span> <span class="mh">0x5f</span><span class="p">,</span> <span class="mh">0x77</span><span class="p">,</span> <span class="mh">0x7f</span><span class="p">,</span> <span class="mh">0x72</span><span class="p">,</span>
        <span class="mh">0x87</span><span class="p">,</span> <span class="mh">0x3b</span><span class="p">,</span> <span class="mh">0x57</span><span class="p">,</span> <span class="mh">0x47</span><span class="p">,</span> <span class="mh">0x67</span><span class="p">,</span> <span class="mh">0x4b</span><span class="p">,</span> <span class="mh">0x81</span><span class="p">,</span> <span class="mh">0x40</span><span class="p">,</span> <span class="mh">0x01</span><span class="p">,</span> <span class="mh">0x58</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x68</span><span class="p">,</span> <span class="mh">0x51</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x88</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x8c</span><span class="p">,</span> <span class="mh">0x92</span><span class="p">,</span> <span class="mh">0x6c</span><span class="p">,</span> <span class="mh">0x02</span><span class="p">,</span>
        <span class="mh">0x03</span><span class="p">,</span> <span class="mh">0x18</span><span class="p">,</span> <span class="mh">0x14</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x10</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x0c</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x08</span><span class="p">,</span> <span class="mh">0x1c</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x24</span><span class="p">,</span>
        <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x28</span><span class="p">,</span> <span class="mh">0x04</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x30</span><span class="p">,</span> <span class="mh">0x31</span><span class="p">,</span> <span class="mh">0x32</span><span class="p">,</span> <span class="mh">0x33</span><span class="p">,</span> <span class="mh">0x34</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span>
        <span class="mh">0x36</span><span class="p">,</span> <span class="mh">0x37</span><span class="p">,</span> <span class="mh">0x38</span><span class="p">,</span> <span class="mh">0x39</span><span class="p">,</span> <span class="mh">0x41</span><span class="p">,</span> <span class="mh">0x42</span><span class="p">,</span> <span class="mh">0x43</span><span class="p">,</span> <span class="mh">0x44</span><span class="p">,</span> <span class="mh">0x45</span><span class="p">,</span> <span class="mh">0x46</span><span class="p">,</span>
        <span class="mh">0x47</span><span class="p">,</span> <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x49</span><span class="p">,</span> <span class="mh">0x4a</span><span class="p">,</span> <span class="mh">0x4b</span><span class="p">,</span> <span class="mh">0x4c</span><span class="p">,</span> <span class="mh">0x4d</span><span class="p">,</span> <span class="mh">0x4e</span><span class="p">,</span> <span class="mh">0x4f</span><span class="p">,</span> <span class="mh">0x50</span><span class="p">,</span>
        <span class="mh">0x51</span><span class="p">,</span> <span class="mh">0x52</span><span class="p">,</span> <span class="mh">0x53</span><span class="p">,</span> <span class="mh">0x54</span><span class="p">,</span> <span class="mh">0x55</span><span class="p">,</span> <span class="mh">0x56</span><span class="p">,</span> <span class="mh">0x57</span><span class="p">,</span> <span class="mh">0x58</span><span class="p">,</span> <span class="mh">0x59</span><span class="p">,</span> <span class="mh">0x5a</span>
    <span class="p">};</span>
    <span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="o">-</span><span class="n">state</span><span class="p">];</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mh">0x00</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span> <span class="o">?</span> <span class="n">t</span><span class="p">[(</span><span class="n">v</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mi">63</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">case</span> <span class="mh">0x2e</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span>  <span class="mi">2</span> <span class="o">?</span> <span class="n">state</span><span class="o">*</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">case</span> <span class="mh">0x2d</span><span class="p">:</span> <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span>  <span class="mi">1</span> <span class="o">?</span> <span class="n">state</span><span class="o">*</span><span class="mi">2</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="nl">default:</span>   <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It typically compiles to under 200 bytes (table included), requires only a
few bytes of memory to operate, and will fit on even the smallest of
microcontrollers. The full source listing, documentation, and
comprehensive test suite:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c">https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c</a></p>

<p>The state machine is trie-shaped, and the 100-byte table <code class="language-plaintext highlighter-rouge">t</code> is the static
<a href="/blog/2016/11/15/">encoding of the Morse code trie</a>:</p>

<p><a href="/img/diagram/morse.dot"><img src="/img/diagram/morse.svg" alt="" /></a></p>

<p>Dots traverse left, dashes right, terminals emit the character at the
current node (terminal state). Stopping on red nodes, or attempting to
take an unlisted edge is an error (invalid input).</p>

<p>Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of <code class="language-plaintext highlighter-rouge">t</code>), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as <a href="https://en.wikipedia.org/wiki/Binary_heap#Heap_implementation">a binary heap in an
array</a>: the left and right children of the node at <code class="language-plaintext highlighter-rouge">i</code> are found at
<code class="language-plaintext highlighter-rouge">i*2+1</code> and <code class="language-plaintext highlighter-rouge">i*2+2</code>. No need to <a href="/blog/2020/10/19/#minimax-costs">waste memory storing edges</a>!</p>

<p>Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No <code class="language-plaintext highlighter-rouge">c</code> input
will produce a bad state.</p>

<p>In the original problem the terminals were missing. Despite being a <em>state
machine</em>, <code class="language-plaintext highlighter-rouge">morse_decode</code> is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.</p>

<h3 id="utf-8-decoder-state-machine">UTF-8 decoder state machine</h3>

<p>The classic UTF-8 decoder state machine is <a href="https://bjoern.hoehrmann.de/utf-8/decoder/dfa/">Bjoern Hoehrmann’s Flexible
and Economical UTF-8 Decoder</a>. It packs the entire state machine into
a relatively small table using clever tricks. It’s easily my favorite
UTF-8 decoder.</p>

<p>I wanted to try my own hand at it, so I re-derived the same canonical
UTF-8 automaton:</p>

<p><a href="/img/diagram/utf8.dot"><img src="/img/diagram/utf8.svg" alt="" /></a></p>

<p>Then I encoded this diagram directly into a much larger (2,064-byte), less
elegant table, too large to display inline here:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c">https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c</a></p>

<p>However, the trade-off is that the executable code is smaller, faster, and
<a href="/blog/2017/10/06/">branchless again</a> (by accident, I swear!):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">utf8_decode</span><span class="p">(</span><span class="kt">int</span> <span class="n">state</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">cp</span><span class="p">,</span> <span class="kt">int</span> <span class="n">byte</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">signed</span> <span class="kt">char</span> <span class="n">table</span><span class="p">[</span><span class="mi">8</span><span class="p">][</span><span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">masks</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="kt">int</span> <span class="n">next</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">state</span><span class="p">][</span><span class="n">byte</span><span class="p">];</span>
    <span class="o">*</span><span class="n">cp</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">cp</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">byte</span> <span class="o">&amp;</span> <span class="n">masks</span><span class="p">[</span><span class="o">!</span><span class="n">state</span><span class="p">][</span><span class="n">next</span><span class="o">&amp;</span><span class="mi">7</span><span class="p">]);</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Like Bjoern’s decoder, there’s a code point accumulator. The <em>real</em> state
machine has 1,109,950 terminal states, and many more edges and nodes. The
accumulator is an optimization to track exactly which edge was taken to
which node without having to represent such a monstrosity.</p>

<p>Despite the huge table I’m pretty happy with it.</p>

<h3 id="word-count-state-machine">Word count state machine</h3>

<p>Here’s another state machine I came up with awhile back for counting words
one Unicode code point at a time while accounting for Unicode’s various
kinds of whitespace. If your input is bytes, then plug this into the above
UTF-8 state machine to convert bytes to code points! This one uses a
switch instead of a lookup table since the table would be sparse (i.e.
<a href="/blog/2019/12/09/">let the compiler figure it out</a>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */</span>
<span class="kt">long</span> <span class="nf">word_count</span><span class="p">(</span><span class="kt">long</span> <span class="n">state</span><span class="p">,</span> <span class="kt">long</span> <span class="n">codepoint</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">codepoint</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="mh">0x0009</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000a</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000b</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000c</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x000d</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x0020</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x0085</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x00a0</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x1680</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2000</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2001</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2002</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2003</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2004</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2005</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2006</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2007</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2008</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2009</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x200a</span><span class="p">:</span>
    <span class="k">case</span> <span class="mh">0x2028</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x2029</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x202f</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x205f</span><span class="p">:</span> <span class="k">case</span> <span class="mh">0x3000</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">state</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">state</span> <span class="o">:</span> <span class="n">state</span><span class="p">;</span>
    <span class="nl">default:</span>
        <span class="k">return</span> <span class="n">state</span> <span class="o">&lt;</span> <span class="mi">0</span> <span class="o">?</span> <span class="n">state</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span> <span class="o">-</span> <span class="n">state</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m particularly happy with the <em>edge-triggered</em> state transition
mechanism. The sign of the state tracks whether the “signal” is “high”
(inside of a word) or “low” (outside of a word), and so it counts rising
edges.</p>

<p><a href="/img/diagram/wordcount.dot"><img src="/img/diagram/wordcount.svg" alt="" /></a></p>

<p>The counter is not <em>technically</em> part of the state machine — though it
eventually overflows for practical reasons, it isn’t really “finite” — but
is rather an external count of the times the state machine transitions
from low to high, which is the actual, useful output.</p>

<p><em>Reader challenge</em>: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the <code class="language-plaintext highlighter-rouge">switch</code> (chain of branches, jump table?).</p>

<h3 id="coroutines-and-generators-as-state-machines">Coroutines and generators as state machines</h3>

<p>In languages that support them, state machines can be implemented using
coroutines, including generators. I do particularly like the idea of
<a href="/blog/2018/05/31/">compiler-synthesized coroutines</a> as state machines, though this is a
rare treat. The state is implicit in the coroutine at each yield, so the
programmer doesn’t have to manage it explicitly. (Though often that
explicit control is powerful!)</p>

<p>Unfortunately in practice it always feels clunky. The following implements
the word count state machine (albeit in a rather un-Pythonic way). The
generator returns the current count and is continued by sending it another
code point:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WHITESPACE</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mh">0x0009</span><span class="p">,</span> <span class="mh">0x000a</span><span class="p">,</span> <span class="mh">0x000b</span><span class="p">,</span> <span class="mh">0x000c</span><span class="p">,</span> <span class="mh">0x000d</span><span class="p">,</span>
    <span class="mh">0x0020</span><span class="p">,</span> <span class="mh">0x0085</span><span class="p">,</span> <span class="mh">0x00a0</span><span class="p">,</span> <span class="mh">0x1680</span><span class="p">,</span> <span class="mh">0x2000</span><span class="p">,</span>
    <span class="mh">0x2001</span><span class="p">,</span> <span class="mh">0x2002</span><span class="p">,</span> <span class="mh">0x2003</span><span class="p">,</span> <span class="mh">0x2004</span><span class="p">,</span> <span class="mh">0x2005</span><span class="p">,</span>
    <span class="mh">0x2006</span><span class="p">,</span> <span class="mh">0x2007</span><span class="p">,</span> <span class="mh">0x2008</span><span class="p">,</span> <span class="mh">0x2009</span><span class="p">,</span> <span class="mh">0x200a</span><span class="p">,</span>
    <span class="mh">0x2028</span><span class="p">,</span> <span class="mh">0x2029</span><span class="p">,</span> <span class="mh">0x202f</span><span class="p">,</span> <span class="mh">0x205f</span><span class="p">,</span> <span class="mh">0x3000</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">def</span> <span class="nf">wordcount</span><span class="p">():</span>
    <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="c1"># low signal
</span>            <span class="n">codepoint</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">count</span>
            <span class="k">if</span> <span class="n">codepoint</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">WHITESPACE</span><span class="p">:</span>
                <span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
                <span class="k">break</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="c1"># high signal
</span>            <span class="n">codepoint</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">count</span>
            <span class="k">if</span> <span class="n">codepoint</span> <span class="ow">in</span> <span class="n">WHITESPACE</span><span class="p">:</span>
                <span class="k">break</span>
</code></pre></div></div>

<p>However, the generator ceremony dominates the interface, so you’d probably
want to wrap it in something nicer — at which point there’s really no
reason to use the generator in the first place:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wc</span> <span class="o">=</span> <span class="n">wordcount</span><span class="p">()</span>
<span class="nb">next</span><span class="p">(</span><span class="n">wc</span><span class="p">)</span>  <span class="c1"># prime the generator
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">'A'</span><span class="p">))</span>  <span class="c1"># =&gt; 1
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">' '</span><span class="p">))</span>  <span class="c1"># =&gt; 1
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">'B'</span><span class="p">))</span>  <span class="c1"># =&gt; 2
</span><span class="n">wc</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="nb">ord</span><span class="p">(</span><span class="s">' '</span><span class="p">))</span>  <span class="c1"># =&gt; 2
</span></code></pre></div></div>

<p>Same idea in Lua, which famously has full coroutines:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">local</span> <span class="n">WHITESPACE</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">[</span><span class="mh">0x0009</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000a</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000b</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x000c</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x000d</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x0020</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x0085</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x00a0</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x1680</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2000</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2001</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2002</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2003</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2004</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2005</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2006</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2007</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2008</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2009</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x200a</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x2028</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x2029</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x202f</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,[</span><span class="mh">0x205f</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span><span class="p">,</span>
    <span class="p">[</span><span class="mh">0x3000</span><span class="p">]</span><span class="o">=</span><span class="kc">true</span>
<span class="p">}</span>

<span class="k">function</span> <span class="nf">wordcount</span><span class="p">()</span>
    <span class="kd">local</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
        <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
            <span class="c1">-- low signal</span>
            <span class="kd">local</span> <span class="n">codepoint</span> <span class="o">=</span> <span class="nb">coroutine.yield</span><span class="p">(</span><span class="n">count</span><span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">WHITESPACE</span><span class="p">[</span><span class="n">codepoint</span><span class="p">]</span> <span class="k">then</span>
                <span class="n">count</span> <span class="o">=</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span>
                <span class="k">break</span>
            <span class="k">end</span>
        <span class="k">end</span>
        <span class="k">while</span> <span class="kc">true</span> <span class="k">do</span>
            <span class="c1">-- high signal</span>
            <span class="kd">local</span> <span class="n">codepoint</span> <span class="o">=</span> <span class="nb">coroutine.yield</span><span class="p">(</span><span class="n">count</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">WHITESPACE</span><span class="p">[</span><span class="n">codepoint</span><span class="p">]</span> <span class="k">then</span>
                <span class="k">break</span>
            <span class="k">end</span>
        <span class="k">end</span>
    <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Except for initially priming the coroutine, at least <code class="language-plaintext highlighter-rouge">coroutine.wrap()</code>
hides the fact that it’s a coroutine.</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wc</span> <span class="o">=</span> <span class="nb">coroutine.wrap</span><span class="p">(</span><span class="n">wordcount</span><span class="p">)</span>
<span class="n">wc</span><span class="p">()</span>  <span class="c1">-- prime the coroutine</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">'A'</span><span class="p">))</span>  <span class="c1">-- =&gt; 1</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span>  <span class="c1">-- =&gt; 1</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">'B'</span><span class="p">))</span>  <span class="c1">-- =&gt; 2</span>
<span class="n">wc</span><span class="p">(</span><span class="nb">string.byte</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span>  <span class="c1">-- =&gt; 2</span>
</code></pre></div></div>

<h3 id="extra-examples">Extra examples</h3>

<p>Finally, a couple more examples not worth describing in detail here. First
a Unicode case folding state machine:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/misc/casefold.c">https://github.com/skeeto/scratch/blob/master/misc/casefold.c</a></p>

<p>It’s just an interface to do a lookup into the <a href="https://www.unicode.org/Public/13.0.0/ucd/CaseFolding.txt">official case folding
table</a>. It was an experiment, and I <em>probably</em> wouldn’t use it in a
real program.</p>

<p>Second, I’ve mentioned <a href="https://github.com/skeeto/utf-7">my UTF-7 encoder and decoder</a> before. It’s
not obvious from the interface, but internally it’s just a state machine
for both encoder and decoder, which is what it allows it to “pause”
between any pair of input/output bytes.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>You might not need machine learning</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/11/24/"/>
    <id>urn:uuid:91aa121d-c796-4c11-99d4-41c707637672</id>
    <updated>2020-11-24T04:04:36Z</updated>
    <category term="ai"/><category term="c"/><category term="media"/><category term="compsci"/><category term="video"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25196574">on Hacker News</a>.</em></p>

<p>Machine learning is a trendy topic, so naturally it’s often used for
inappropriate purposes where a simpler, more efficient, and more reliable
solution suffices. The other day I saw an illustrative and fun example of
this: <a href="https://www.youtube.com/watch?v=-sg-GgoFCP0">Neural Network Cars and Genetic Algorithms</a>. The video
demonstrates 2D cars driven by a neural network with weights determined by
a generic algorithm. However, the entire scheme can be replaced by a
first-degree polynomial without any loss in capability. The machine
learning part is overkill.</p>

<p><a href="https://nullprogram.com/video/?v=racetrack"><img src="/img/screenshot/racetrack.jpg" alt="" /></a></p>

<!--more-->

<p>Above demonstrates my implementation using a polynomial to drive the cars.
My wife drew the background. There’s no path-finding; these cars are just
feeling their way along the track, “following the rails” so to speak.</p>

<p>My intention is not to pick on this project in particular. The likely
motivation in the first place was a desire to apply a neural network to
<em>something</em>. Many of my own projects are little more than a vehicle to try
something new, so I can sympathize. Though a professional setting is
different, where machine learning should be viewed with a more skeptical
eye than it’s usually given. For instance, don’t use active learning to
select sample distribution when a <a href="http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/">quasirandom sequence</a> will do.</p>

<p>In the video, the car has a limited turn radius, and minimum and maximum
speeds. (I’ve retained these contraints in my own simulation.) There are
five sensors — forward, forward-diagonals, and sides — each sensing the
distance to the nearest wall. These are fed into a 3-layer neural network,
and the outputs determine throttle and steering. Sounds pretty cool!</p>

<p><img src="/img/diagram/racecar.svg" alt="" /></p>

<p>A key feature of neural networks is that the outputs are a nonlinear
function of the inputs. However, steering a 2D car is simple enough that
<strong>a linear function is more than sufficient</strong>, and neural networks are
unnecessary. Here are my equations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>steering = C0*input1 - C0*input3
throttle = C1*input2
</code></pre></div></div>

<p>I only need three of the original inputs — forward for throttle, and
diagonals for steering — and the driver has just two parameters, <code class="language-plaintext highlighter-rouge">C0</code> and
<code class="language-plaintext highlighter-rouge">C1</code>, the polynomial coefficients. Optimal values depend on the track
layout and car configuration, but for my simulation, most values above 0
and below 1 are good enough in most cases. It’s less a matter of crashing
and more about navigating the course quickly.</p>

<p>The lengths of the red lines below are the driver’s three inputs:</p>

<video src="/vid/racecar.mp4" width="530" height="330" loop="" muted="" autoplay="" controls="">
</video>

<p>These polynomials are obviously much faster than a neural network, but
they’re also easy to understand and debug. I can confidently reason about
the entire range of possible inputs rather than worry about a trained
neural network <a href="https://arxiv.org/abs/1903.06638">responding strangely</a> to untested inputs.</p>

<p>Instead of doing anything fancy, my program generates the coefficients at
random to explore the space. If I wanted to generate a good driver for a
course, I’d run a few thousand of these and pick the coefficients that
complete the course in the shortest time. For instance, these coefficients
make for a fast, capable driver for the course featured at the top of the
article:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C0 = 0.896336973, C1 = 0.0354805067
</code></pre></div></div>

<p>Many constants can complete the track, but some will be faster than
others. If I was developing a racing game using this as the AI, I’d not
just pick constants that successfully complete the track, but the ones
that do it quickly. Here’s what the spread can look like:</p>

<video src="/vid/racecars.mp4" width="530" height="330" loop="" muted="" autoplay="" controls="">
</video>

<p>If you want to play around with this yourself, here’s my C source code
that implements this driving AI and <a href="/blog/2017/11/03/">generates the videos and images
above</a>:</p>

<p><strong><a href="https://github.com/skeeto/scratch/blob/master/aidrivers/aidrivers.c">aidrivers.c</a></strong></p>

<p>Racetracks are just images drawn in your favorite image editing program
using the colors documented in the source header.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Improving on QBasic's Random Number Generator</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/11/17/"/>
    <id>urn:uuid:9aba5382-01e4-41fc-bc27-b996b3c17f07</id>
    <updated>2020-11-17T02:51:23Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25120083">on Hacker News</a>.</em></p>

<p><a href="https://www.pixelships.com/">Pixelmusement</a> produces videos about <a href="/blog/2020/10/19/">MS-DOS games</a> and software.
Each video ends with a short, randomly-selected listing of financial
backers. In <a href="https://www.youtube.com/watch?v=YVV9bkbpaPY">ADG Filler #57</a>, Kris revealed the selection process,
and it absolutely fits the channel’s core theme: a <a href="https://en.wikipedia.org/wiki/QBasic">QBasic</a> program.
His program relies on QBasic’s built-in pseudo random number generator
(PRNG). Even accounting for the platform’s limitations, the PRNG is much
poorer quality than it could be. Let’s discuss these weaknesses and figure
out how to make the selection more fair.</p>

<!--more-->

<p>Kris’s program seeds the PRNG with the system clock (<code class="language-plaintext highlighter-rouge">RANDOMIZE TIMER</code>, a
QBasic idiom), populates an array with the backers represented as integers
(indices), continuously shuffles the list until the user presses a key, then
finally prints out a random selection from the array. Here’s a simplified
version of the program (note: QBasic comments start with apostrophe <code class="language-plaintext highlighter-rouge">'</code>):</p>

<pre><code class="language-qbasic">CONST ntickets = 203  ' input parameter
CONST nresults = 12

RANDOMIZE TIMER

DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
    tickets(i) = i
NEXT

CLS
PRINT "Press any key to stop shuffling..."
DO
    i = INT(RND * ntickets)
    j = INT(RND * ntickets)
    SWAP tickets(i), tickets(j)
LOOP WHILE INKEY$ = ""

FOR i = 0 to nresults - 1
    PRINT tickets(i)
NEXT
</code></pre>

<p>This should be readable even if you don’t know QBasic. Note: In the real
program, backers at higher tiers get multiple tickets in order to weight
the results. This is accounted for in the final loop such that nobody
appears more than once. It’s mostly irrelevant to the discussion here, so
I’ve omitted it.</p>

<p>The final result is ultimately a function of just three inputs:</p>

<ol>
  <li>The system clock (<code class="language-plaintext highlighter-rouge">TIMER</code>)</li>
  <li>The total number of tickets</li>
  <li>The number of loop iterations until a key press</li>
</ol>

<p>The second item has the nice property that by becoming a backer you influence
the result.</p>

<h3 id="qbasic-rnd">QBasic RND</h3>

<p>QBasic’s PRNG is this 24-bit <a href="https://en.wikipedia.org/wiki/Linear_congruential_generator">Linear Congruential Generator</a> (LCG):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">rnd24</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The result is the entire 24-bit state. <code class="language-plaintext highlighter-rouge">RND</code> divides this by 2^24 and
returns it as a single precision float so that the caller receives a value
between 0 and 1 (exclusive).</p>

<p>Needless to say, this is a very poor PRNG. The <a href="/blog/2019/11/19/">LCG constants are
<em>reasonable</em></a>, but the choice to limit the state to 24 bits is
strange. According to the <a href="https://www.qb64.org/forum/index.php?topic=1414.0">QBasic 16-bit assembly</a> (note: the LCG
constants listed here <a href="http://www.qb64.net/forum/index_topic_10727-0/">are wrong</a>), the implementation is a full
32-bit multiply using 16-bit limbs, and it allocates and writes a full 32
bits when storing the state. As expected for the 8086, there was nothing
gained by using only the lower 24 bits.</p>

<p>To illustrate how poor it is, here’s a <a href="https://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html">randogram</a> for this PRNG,
which shows obvious structure. (This is a small slice of a 4096x4096
randogram where each of the 2^23 24-bit samples is plotted as two 12-bit
coordinates.)</p>

<p><img src="/img/qbasic/rnd-thumb.png" alt="" /></p>

<p>Admittedly this far <a href="https://www.pcg-random.org/paper.html"><em>overtaxes</em></a> the PRNG. With a 24-bit state, it’s
only good for 4,096 (2^12) outputs, after which it no longer follows the
<a href="/blog/2019/07/22/">birthday paradox</a>: No outputs are repeated even though we should
start seeing some. However, as I’ll soon show, this doesn’t actually
matter.</p>

<p>Instead of discarding the high 8 bits — the highest quality output bits —
QBasic’s designers should have discarded the <em>low</em> 8 bits for the output,
turning it into a <em>truncated 32-bit LCG</em>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">rnd32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This LCG would have the same performance, but significantly better
quality. Here’s the randogram for this PRNG, and it is <em>also</em> heavily
overtaxed (more than 65,536, 2^16 outputs).</p>

<p><img src="/img/qbasic/rnd32-thumb.png" alt="" /></p>

<p>It’s a solid upgrade, <em>completely for free</em>!</p>

<h3 id="qbasic-randomize">QBasic RANDOMIZE</h3>

<p>That’s not the end of our troubles. The <code class="language-plaintext highlighter-rouge">RANDOMIZE</code> statement accepts a
double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754
binary representation are XORed with the next highest 16 bits. The high 16
bits of the PRNG state is set to this result. The lowest 8 bits are
preserved.</p>

<p>To make this clearer, here’s a C implementation, verified against QBasic
7.1:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="n">s</span><span class="p">;</span>

<span class="kt">void</span>
<span class="nf">randomize</span><span class="p">(</span><span class="kt">double</span> <span class="n">seed</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span> <span class="p">,</span><span class="o">&amp;</span><span class="n">seed</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">24</span> <span class="o">^</span> <span class="n">x</span><span class="o">&gt;&gt;</span><span class="mi">40</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffff00</span> <span class="o">|</span> <span class="p">(</span><span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0xff</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In other words, <strong><code class="language-plaintext highlighter-rouge">RANDOMIZE</code> only sets the PRNG to one of 65,536 possible
states</strong>.</p>

<p>As the final piece, here’s how <code class="language-plaintext highlighter-rouge">RND</code> is implemented, also verified against
QBasic 7.1:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">rnd</span><span class="p">(</span><span class="kt">float</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arg</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">s</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">arg</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">arg</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="n">s</span><span class="o">*</span><span class="mh">0xfd43fd</span> <span class="o">+</span> <span class="mh">0xc39ec3</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffffff</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">s</span> <span class="o">/</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="mh">0x1000000</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="system-clock-seed">System clock seed</h3>

<p>The <a href="https://www.qb64.org/wiki/TIMER"><code class="language-plaintext highlighter-rouge">TIMER</code> function</a> returns the single precision number of
seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer
interrupt counter). This is strictly time of day, and the current date is
not part of the result, unlike, say, the unix epoch.</p>

<p>This means there are only 1,572,480 distinct values returned by <code class="language-plaintext highlighter-rouge">TIMER</code>.
That’s small even before considering that these map onto only 65,536
possible seeds with <code class="language-plaintext highlighter-rouge">RANDOMIZE</code> — all of which <em>are</em> fortunately
realizable via <code class="language-plaintext highlighter-rouge">TIMER</code>.</p>

<p>Of the three inputs to random selection, this first one is looking pretty
bad.</p>

<h3 id="loop-iterations">Loop iterations</h3>

<p>Kris’s idea of continuously mixing the array until he presses a key makes
up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000
array swaps — traversing over 2% of the PRNG’s period — and the array
itself acts like an extended PRNG state, supplementing the 24-bit <code class="language-plaintext highlighter-rouge">RND</code>
state.</p>

<p>Since iterations fly by quickly, the exact number of iterations becomes
another <a href="/blog/2019/04/30/">source of entropy</a>. The results will be quite different if it
runs 214,600 iterations versus 273,500 iterations.</p>

<p>Possible improvement: Only exit the loop when a certain key is pressed. If
any other key is pressed then that input and the <code class="language-plaintext highlighter-rouge">TIMER</code> are mixed into
the PRNG state. Mashing the keyboard during the loop introduces more
entropy.</p>

<h3 id="replacing-the-prng">Replacing the PRNG</h3>

<p>Since the built-in PRNG is so poor, we could improve the situation by
implementing a <a href="/blog/2017/09/21/">new one</a> in QBasic itself. The challenge is that
QBasic has no unsigned integers, not even unsigned integer operators (i.e.
Java and JavaScript’s <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code>), and signed overflow is a run-time error. We
can’t even re-implement QBasic’s own LCG without doing long multiplication
in software, since the intermediate result overflows its 32-bit <code class="language-plaintext highlighter-rouge">LONG</code>.</p>

<p>Popular choices in these constraints are <a href="https://en.wikipedia.org/wiki/Lehmer_random_number_generator">Park–Miller generator</a> (as
we saw <a href="/blog/2018/12/25/">in Bash</a>) or a <a href="https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator">lagged Fibonacci generator</a> (as used by
Emacs, which was for a long time constrained to 29-bit integers).</p>

<p>However, I have a better idea: a PRNG based on <a href="https://en.wikipedia.org/wiki/RC4">RC4</a>. Specifically,
my own design called <a href="https://github.com/skeeto/scratch/tree/master/sp4"><strong>Sponge4</strong></a>, a <a href="https://en.wikipedia.org/wiki/Sponge_function">sponge construction</a>
built atop RC4. In short: Mixing in more input is just a matter of running
the key schedule again. Implementing this PRNG requires just two simple
operations: modular addition over 2^8, and array swap. QBasic has a <code class="language-plaintext highlighter-rouge">SWAP</code>
statement, so it’s a natural fit!</p>

<p>Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I
can mix in more sources of entropy. With its 1,700-bit state, it can
absorb quite a bit of entropy without loss.</p>

<h4 id="learning-qbasic">Learning QBasic</h4>

<p>Until this past weekend, I had not touched QBasic for about 23 years and
had to learn it essentially from scratch. Though within a couple of hours
I probably already understood it better than I ever had. That’s in large
part because I’m far more experienced, but also probably because QBasic
tutorials are universally awful. Not surprisingly they’re written for
beginners, but they also seem to be all written <em>by</em> beginners, too. I
soon got the impression that QBasic community has usually been another
case of <a href="/blog/2019/09/25/">the blind leading the blind</a>.</p>

<p>There’s little direct information for experienced programmers, and even
the official documentation tends to be thin in important places. I wanted
documentation that started with the core language semantics:</p>

<ul>
  <li>
    <p>The basic types are INTEGER (int16), LONG (int32), SINGLE (float32),
DOUBLE (float64), and two flavors of STRING, fixed-width and
variable-width. Late versions also had incomplete support for a 64-bit,
10,000x fixed-point CURRENCY type.</p>
  </li>
  <li>
    <p>Variables are SINGLE by default and do not need to be declared ahead of
time. Arrays have 11 elements by default.</p>
  </li>
  <li>
    <p>Variables, constants, and functions may have a suffix if their type is
not SINGLE: INTEGER <code class="language-plaintext highlighter-rouge">%</code>, LONG <code class="language-plaintext highlighter-rouge">&amp;</code>, SINGLE <code class="language-plaintext highlighter-rouge">!</code>, DOUBLE <code class="language-plaintext highlighter-rouge">#</code>, STRING <code class="language-plaintext highlighter-rouge">$</code>,
and CURRENCY <code class="language-plaintext highlighter-rouge">@</code>. For functions, this is the return type.</p>
  </li>
  <li>
    <p>Each variable type has its own namespace, i.e. <code class="language-plaintext highlighter-rouge">i%</code> is distinct from
<code class="language-plaintext highlighter-rouge">i&amp;</code>. Arrays are also their own namespace, i.e. <code class="language-plaintext highlighter-rouge">i%</code> is distinct from
<code class="language-plaintext highlighter-rouge">i%(0)</code> is distinct from <code class="language-plaintext highlighter-rouge">i&amp;(0)</code>.</p>
  </li>
  <li>
    <p>Variables may be declared explicitly with <code class="language-plaintext highlighter-rouge">DIM</code>. Declaring a variable
with <code class="language-plaintext highlighter-rouge">DIM</code> allows the suffix to be omitted. It also locks that name out
of the other type namespaces, i.e. <code class="language-plaintext highlighter-rouge">DIM i AS LONG</code> makes any use of <code class="language-plaintext highlighter-rouge">i%</code>
invalid in that scope. Though arrays and scalars can still have the same
name even with <code class="language-plaintext highlighter-rouge">DIM</code> declarations.</p>
  </li>
  <li>
    <p>Numeric operations with mixed types implicitly promote like C.</p>
  </li>
  <li>
    <p>Functions and subroutines have a single, common namespace regardless of
function suffix. As a result, the suffix can (usually) be omitted at
function call sites. Built-in functions are special in this case.</p>
  </li>
  <li>
    <p>Despite initial appearances, QBasic is statically-typed.</p>
  </li>
  <li>
    <p>The default is pass-by-reference. Use <code class="language-plaintext highlighter-rouge">BYVAL</code> to pass by value.</p>
  </li>
  <li>
    <p>In array declarations, the parameter is not the <em>size</em> but the largest
index. Multidimensional arrays are supported. Arrays need not be indexed
starting at zero (e.g. <code class="language-plaintext highlighter-rouge">(x TO y)</code>), though this is the default.</p>
  </li>
  <li>
    <p>Strings are not arrays, but their own special thing with special
accessor statements and functions.</p>
  </li>
  <li>
    <p>Scopes are module, subroutine, and function. “Global” variables must be
declared with <code class="language-plaintext highlighter-rouge">SHARED</code>.</p>
  </li>
  <li>
    <p>Users can define custom structures with <code class="language-plaintext highlighter-rouge">TYPE</code>. Functions cannot return
user-defined types and instead rely on pass-by-reference.</p>
  </li>
  <li>
    <p>A crude kind of dynamic allocation is supported with <code class="language-plaintext highlighter-rouge">REDIM</code> to resize
<code class="language-plaintext highlighter-rouge">$DYNAMIC</code> arrays at run-time. <code class="language-plaintext highlighter-rouge">ERASE</code> frees allocations.</p>
  </li>
</ul>

<p><em>These</em> are the semantics I wanted to know getting started. Throw in some
illustrative examples, and then it’s a tutorial for experienced
developers. (Future article perhaps?) Anyway, that’s enough to follow
along below.</p>

<h4 id="implementing-sponge4">Implementing Sponge4</h4>

<p>Like RC4, I need a 256-element byte array, and two 1-byte indices, <code class="language-plaintext highlighter-rouge">i</code> and
<code class="language-plaintext highlighter-rouge">j</code>. Sponge4 also keeps a third 1-byte counter, <code class="language-plaintext highlighter-rouge">k</code>, to count input.</p>

<pre><code class="language-qbasic">TYPE sponge4
    i AS INTEGER
    j AS INTEGER
    k AS INTEGER
    s(0 TO 255) AS INTEGER
END TYPE
</code></pre>

<p>QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would
normally be a good match here, but since they’re not arrays, strings are
not compatible with <code class="language-plaintext highlighter-rouge">SWAP</code> and are not indexed efficiently. So instead I
accept some wasted space and use 16-bit integers for everything.</p>

<p>There are four “methods” for this structure. Three are subroutines since
they don’t return a value, but mutate the sponge. The last, <code class="language-plaintext highlighter-rouge">squeeze</code>,
returns the next byte as an INTEGER (<code class="language-plaintext highlighter-rouge">%</code>).</p>

<pre><code class="language-qbasic">DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)
</code></pre>

<p>Initialization follows RC4:</p>

<pre><code class="language-qbasic">SUB init (r AS sponge4)
    r.i = 0
    r.j = 0
    r.k = 0
    FOR i% = 0 TO 255
        r.s(i%) = i%
    NEXT
END SUB
</code></pre>

<p>Absorbing a byte means running the RC4 key schedule one step. Absorbing a
“stop” symbol, for separating inputs, transforms the state in a way that
absorbing a byte cannot.</p>

<pre><code class="language-qbasic">SUB absorb (r AS sponge4, b AS INTEGER)
    r.j = (r.j + r.s(r.i) + b) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    r.i = (r.i + 1) MOD 256
    r.k = (r.k + 1) MOD 256
END SUB

SUB absorbstop (r AS sponge4)
    r.j = (r.j + 1) MOD 256
END SUB
</code></pre>

<p>Squeezing a byte may involve mixing the state first, then it runs the RC4
generator normally.</p>

<pre><code class="language-qbasic">FUNCTION squeeze% (r AS sponge4)
    IF r.k &gt; 0 THEN
        absorbstop r
        DO WHILE r.k &gt; 0
            absorb r, r.k
        LOOP
    END IF

    r.j = (r.j + r.i) MOD 256
    r.i = (r.i + 1) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION
</code></pre>

<p>That’s the entire generator in QBasic! A couple more helper functions will
be useful, though. One absorbs entire strings, and the second emits 24-bit
results.</p>

<pre><code class="language-qbasic">SUB absorbstr (r AS sponge4, s AS STRING)
    FOR i% = 1 TO LEN(s)
        absorb r, ASC(MID$(s, i%))
    NEXT
END SUB

FUNCTION squeeze24&amp; (r AS sponge4)
    b0&amp; = squeeze%(r)
    b1&amp; = squeeze%(r)
    b2&amp; = squeeze%(r)
    squeeze24&amp; = b2&amp; * &amp;H10000 + b1&amp; * &amp;H100 + b0&amp;
END FUNCTION
</code></pre>

<p>QBasic doesn’t have bit-shift operations, so we must make due with
multiplication. The <code class="language-plaintext highlighter-rouge">&amp;H</code> is hexadecimal notation.</p>

<h4 id="putting-the-sponge-to-use">Putting the sponge to use</h4>

<p>One of the problems with the original program is that only the time of day
was a seed. Even were it mixed better, if we run the program at exactly
the same instant on two different days, we get the same seed. The <code class="language-plaintext highlighter-rouge">DATE$</code>
function returns the current date, which we can absorb into the sponge to
make the whole date part of the input.</p>

<pre><code class="language-qbasic">DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE$
absorbstr sponge, MKS$(TIMER)
absorbstr sponge, MKI$(ntickets)
</code></pre>

<p>I follow this up with the timer. It’s converted to a string with <code class="language-plaintext highlighter-rouge">MKS$</code>,
which returns the little-endian, single precision binary representation as
a 4-byte string. <code class="language-plaintext highlighter-rouge">MKI$</code> does the same for INTEGER, as a 2-byte string.</p>

<p>One of the problems with the original program was bias: Multiplying <code class="language-plaintext highlighter-rouge">RND</code>
by a constant, then truncating the result to an integer is not uniform in
most cases. Some numbers are selected slightly more often than others
because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all
the shuffling in the original it probably doesn’t make a practical
difference, but I’d like to avoid it.</p>

<p>In my program I account for it by generating another number if it happens
to fall into that extra “tail” part of the input distribution (very
unlikely for small <code class="language-plaintext highlighter-rouge">ntickets</code>). The <code class="language-plaintext highlighter-rouge">squeezen</code> function uniformly
generates a number in 0 to N (exclusive).</p>

<pre><code class="language-qbasic">FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
    DO
       x&amp; = squeeze24&amp;(r) - &amp;H1000000 MOD n
    LOOP WHILE x&amp; &lt; 0
    squeezen% = x&amp; MOD n
END FUNCTION
</code></pre>

<p>Finally a Fisher–Yates shuffle, then print the first N elements:</p>

<pre><code class="language-qbasic">FOR i% = ntickets - 1 TO 1 STEP -1
    j% = squeezen%(sponge, i% + 1)
    SWAP tickets(i%), tickets(j%)
NEXT

FOR i% = 1 TO nresults
    PRINT tickets(i%)
NEXT
</code></pre>

<p>Though if you really love Kris’s loop idea:</p>

<pre><code class="language-qbasic">PRINT "Press Esc to finish, any other key for entropy..."
DO
    c&amp; = c&amp; + 1
    LOCATE 2, 1
    PRINT "cycles ="; c&amp;; "; keys ="; k%

    FOR i% = ntickets - 1 TO 1 STEP -1
        j% = squeezen%(sponge, i% + 1)
        SWAP tickets(i%), tickets(j%)
    NEXT

    k$ = INKEY$
    IF k$ = CHR$(27) THEN
        EXIT DO
    ELSEIF k$ &lt;&gt; "" THEN
        k% = k% + 1
        absorbstr sponge, k$
    END IF
    absorbstr sponge, MKS$(TIMER)
LOOP
</code></pre>

<p>If you want to try it out for yourself in, say, DOSBox, here’s the full
source: <a href="https://github.com/skeeto/scratch/blob/master/sp4/sponge4.bas"><strong><code class="language-plaintext highlighter-rouge">sponge4.bas</code></strong></a></p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>I Solved British Square</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/10/19/"/>
    <id>urn:uuid:c500b91a-046f-4320-8eff-9bc8f8443ef3</id>
    <updated>2020-10-19T19:32:52Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update</em>: I <a href="/blog/2022/10/12/">solved another game</a> using essentially the same
technique.</p>

<p><a href="https://boardgamegeek.com/boardgame/3719/british-square">British Square</a> is a 1978 abstract strategy board game which I
recently discovered <a href="https://www.youtube.com/watch?v=PChKZbut3lM&amp;t=10m">from a YouTube video</a>. It’s well-suited to play
by pencil-and-paper, so my wife and I played a few rounds to try it out.
Curious about strategies, I searched online for analysis and found
nothing whatsoever, meaning I’d have to discover strategies for myself.
This is <em>exactly</em> the sort of problem that <a href="https://xkcd.com/356/">nerd snipes</a>, and so I
sunk a couple of evenings building an analysis engine in C — enough to
fully solve the game and play <em>perfectly</em>.</p>

<p><strong>Repository</strong>: <a href="https://github.com/skeeto/british-square"><strong>British Square Analysis Engine</strong></a>
(and <a href="https://github.com/skeeto/british-square/releases">prebuilt binaries</a>)</p>

<p><a href="/img/british-square/british-square.jpg"><img src="/img/british-square/british-square-thumb.jpg" alt="" /></a>
<!-- Photo credit: Kelsey Wellons --></p>

<!--more-->

<p>The game is played on a 5-by-5 grid with two players taking turns
placing pieces of their color. Pieces may not be placed on tiles
4-adjacent to an opposing piece, and as a special rule, the first player
may not play the center tile on the first turn. Players pass when they
have no legal moves, and the game ends when both players pass. The score
is the difference between the piece counts for each player.</p>

<p>In the default configuration, my engine takes a few seconds to explore
the full game tree, then presents the <a href="https://en.wikipedia.org/wiki/Minimax">minimax</a> values for the
current game state along with the list of perfect moves. The UI allows
manually exploring down the game tree. It’s intended for analysis, but
there’s enough UI present to “play” against the AI should you so wish.
For some of my analysis I made small modifications to the program to
print or count game states matching certain conditions.</p>

<h3 id="game-analysis">Game analysis</h3>

<p>Not accounting for symmetries, there are 4,233,789,642,926,592 possible
playouts. In these playouts, the first player wins 2,179,847,574,830,592
(~51%), the second player wins 1,174,071,341,606,400 (~28%), and the
remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious
the first player has a huge advantage.</p>

<p>Accounting for symmetries, there are 8,659,987 total game states. Of
these, 6,955 are terminal states, of which the first player wins 3,599
(~52%) and the second player wins 2,506 (~36%). This small number of
states is what allows the engine to fully explore the game tree in a few
seconds.</p>

<p>Most importantly: <strong>The first player can always win by two points.</strong> In
other words, it’s <em>not</em> like Tic-Tac-Toe where perfect play by both
players results in a tie. Due to the two-point margin, the first player
also has more room for mistakes and usually wins even without perfect
play. There are fewer opportunities to blunder, and a single blunder
usually results in a lower win score. The second player has a narrow
lane of perfect play, making it easy to blunder.</p>

<p>Below is the minimax analysis for the first player’s options. The number
is the first player’s score given perfect play from that point — i.e.
perfect play starts on the tiles marked “2”, and the tiles marked “0”
are blunders that lead to ties.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10-01
12021
11111
</code></pre></div></div>

<p>The special center rule probably exists to reduce the first player’s
obvious advantage, but in practice it makes little difference. Without
the rule, the first player has an additional (fifth) branch for a win by
two points:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10201
12021
11111
</code></pre></div></div>

<p>Improved alternative special rule: <strong>Bias the score by two in favor of
the second player.</strong> This fully eliminates the first player’s advantage,
perfect play by both sides results in a tie, and both players have a
narrow lane of perfect play.</p>

<p>The four tie openers are interesting because the reasoning does not
require computer assistance. If the first player opens on any of those
tiles, the second player can mirror each of the first player’s moves,
guaranteeing a tie. Note: The first player can still make mistakes that
results in a second player win <em>if</em> the second player knows when to stop
mirroring.</p>

<p>One of my goals was to develop a heuristic so that even human players
can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was
not able to develop any such heuristic, though I <em>was</em> able to prove
that <strong>a greedy heuristic — always claim as much territory as possible —
is often incorrect</strong> and, in some cases, leads to blunders.</p>

<h3 id="engine-implementation">Engine implementation</h3>

<p>As <a href="/blog/2017/04/27/">I’ve done before</a>, my engine represents the game using
<a href="https://www.chessprogramming.org/Bitboards">bitboards</a>. Each player has a 25-bit bitboard representing their
pieces. To make move validation more efficient, it also sometimes tracks
a “mask” bitboard where invalid moves have been masked. Updating all
bitboards is cheap (<code class="language-plaintext highlighter-rouge">place()</code>, <code class="language-plaintext highlighter-rouge">mask()</code>), as is validating moves
against the mask (<code class="language-plaintext highlighter-rouge">valid()</code>).</p>

<p>The longest possible game is 32 moves. This would <em>just</em> fit in 5 bits,
except that I needed a special “invalid” turn, making it a total of 33
bits. So I use 6 bits to store the turn counter.</p>

<p>Besides generally being unnecessary, the validation masks can be derived
from the main bitboards, so I don’t need to store them in the game tree.
That means I need 25 bits per player, and 6 bits for the counter: <strong>56
bits total</strong>. I pack these into a 64-bit integer. The first player’s
bitboard goes in the bottom 25 bits, the second player in the next 25
bits, and the turn counter in the topmost 6 bits. The turn counter
starts at 1, so an all zero state is invalid. I exploit this in the hash
table so that zeroed slots are empty (more on this later).</p>

<p>In other words, the <em>empty</em> state is <code class="language-plaintext highlighter-rouge">0x4000000000000</code> (<code class="language-plaintext highlighter-rouge">INIT</code>) and zero
is the null (invalid) state.</p>

<p>Since the state is so small, rather than passing a pointer to a state to
be acted upon, bitboard functions return a new bitboard with the
requested changes… functional style.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// Compute bitboard+mask where first play is tile 6</span>
    <span class="c1">// -----</span>
    <span class="c1">// -X---</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="kt">uint64_t</span> <span class="n">b</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">place</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">mask</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
</code></pre></div></div>

<h4 id="minimax-costs">Minimax costs</h4>

<p>The engine uses minimax to propagate information up the tree. Since the
search extends to the very bottom of the tree, the minimax “heuristic”
evaluation function is the actual score, not an approximation, which is
why it’s able to play perfectly.</p>

<p>When <a href="/blog/2010/10/17/">I’ve used minimax before</a>, I built an actual tree data
structure in memory, linking states by pointer / reference. In this
engine there is no such linkage, and instead the links are computed
dynamically via the validation masks. Storing the pointers is more
expensive than computing their equivalents on the fly, <em>so I don’t store
them</em>. Therefore my game tree only requires 56 bits per node — or 64
bits in practice since I’m using a 64-bit integer. With only 8,659,987
nodes to store, that’s a mere 66MiB of memory! This analysis could have
easily been done on commodity hardware two decades ago.</p>

<p>What about the minimax values? Game scores range from -10 to 11: 22
distinct values. (That the first player can score up to 11 and the
second player at most 10 is another advantage to going first.) That’s 5
bits of information. However, I didn’t have this information up front,
and so I assumed a range from -25 to 25, which requires 6 bits.</p>

<p>There are still 8 spare bits left in the 64-bit integer, so I use 6 of
them for the minimax score. Rather than worry about two’s complement, I
bias the score to eliminate negative values before storing it. So the
minimax score rides along for free above the state bits.</p>

<h4 id="hash-table-memoization">Hash table (memoization)</h4>

<p>The vast majority of game tree branches are redundant. Even without
taking symmetries into account, nearly all states are reachable from
multiple branches. Exploring all these redundant branches would take
centuries. If I run into a state I’ve seen before, I don’t want to
recompute it.</p>

<p>Once I’ve computed a result, I store it in a hash table so that I can
find it later. Since the state is just a 64-bit integer, I use <a href="/blog/2018/07/31/">an
integer hash function</a> to compute a starting index from which to
linearly probe an open addressing hash table. The <em>entire</em> hash table
implementation is literally a dozen lines of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="o">*</span>
<span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">bitboard</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">table</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="mh">0xffffffffffffff</span><span class="p">;</span> <span class="c1">// sans minimax</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">bitboard</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">*=</span> <span class="mh">0xcca1cee435c5048f</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">^=</span> <span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span> <span class="o">%</span> <span class="n">N</span><span class="p">;</span> <span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="n">mask</span> <span class="o">==</span> <span class="n">bitboard</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the bitboard is not found, it returns a pointer to the (zero-valued)
slot where it should go so that the caller can fill it in.</p>

<h4 id="canonicalization">Canonicalization</h4>

<p>Memoization eliminates nearly all redundancy, but there’s still a major
optimization left. Many states are equivalent by symmetry or reflection.
Taking that into account, about 7/8th of the remaining work can still be
eliminated.</p>

<p>Multiple different states that are identical by symmetry must to be
somehow “folded” into a single, <em>canonical</em> state to represent them all.
I do this by visiting all 8 rotations and reflections and choosing the
one with the smallest 64-bit integer representation.</p>

<p>I only need two operations to visit all 8 symmetries, and I chose
transpose (flip around the diagonal) and vertical flip. Alternating
between these operations visits each symmetry. Since they’re bitboards,
transforms can be implemented using <a href="https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating">fancy bit-twiddling hacks</a>.
Chess boards, with their power-of-two dimensions, have useful properties
which these British Square boards lack, so this is the best I could come
up with:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Transpose a board or mask (flip along the diagonal).</span>
<span class="kt">uint64_t</span>
<span class="nf">transpose</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000020000010</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000410000208</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00008208004104</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00104104082082</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfe082083041041</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x01041040820820</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00820800410400</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00410000208000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00200000100000</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Flip a board or mask vertically.</span>
<span class="kt">uint64_t</span>
<span class="nf">flipv</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0000003e00001f</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x000007c00003e0</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfc00f800007c00</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x001f00000f8000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x03e00001f00000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These transform both players’ bitboards in parallel while leaving the
turn counter intact. The logic here is quite simple: Shift the bitboard
a little bit at a time while using a mask to deposit bits in their new
home once they’re lined up. It’s like a coin sorter. Vertical flip is
analogous to byte-swapping, though with 5-bit “bytes”.</p>

<p>Canonicalizing a bitboard now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">canonicalize</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">c</span> <span class="o">=</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Callers need only use <code class="language-plaintext highlighter-rouge">canonicalize()</code> on values they pass to <code class="language-plaintext highlighter-rouge">lookup()</code>
or store in the table (via the returned pointer).</p>

<h3 id="developing-a-heuristic">Developing a heuristic</h3>

<p>If you can come up with a perfect play heuristic, especially one that
can be reasonably performed by humans, I’d like to hear it. My engine
has a built-in heuristic tester, so I can test it against perfect play
at all possible game positions to check that it actually works. It’s
currently programmed to test the greedy heuristic and print out the
millions of cases where it fails. Even a heuristic that fails in only a
small number of cases would be pretty reasonable.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>w64devkit: (Almost) Everything You Need</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/09/25/"/>
    <id>urn:uuid:e594c82d-a2e1-4035-8527-1b998045ceeb</id>
    <updated>2020-09-25T00:04:11Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="rant"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=24586556">on Hacker News</a>.</em></p>

<p><a href="/blog/2020/05/15/">This past May</a> I put together my own C and C++ development
distribution for Windows called <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>. The <em>entire</em>
release weighs under 80MB and requires no installation. Unzip and run it
in-place anywhere. It’s also entirely offline. It will never
automatically update, or even touch the network. In mere seconds any
Windows system can become a reliable development machine. (To further
increase reliability, <a href="https://jacquesmattheij.com/why-johnny-wont-upgrade/">disconnect it from the internet</a>.) Despite
its simple nature and small packaging, w64devkit is <em>almost</em> everything
you need to develop <em>any</em> professional desktop application, from a
command line utility to a AAA game.</p>

<!--more-->

<p>I don’t mean this in some <a href="/blog/2016/04/30/">useless Turing-complete sense</a>, but in
a practical, <em>get-stuff-done</em> sense. It’s much more a matter of
<em>know-how</em> than of tools or libraries. So then what is this “almost”
about?</p>

<ul>
  <li>
    <p>The distribution does not have WinAPI documentation. It’s notoriously
<a href="http://laurencejackson.com/win32/">difficult to obtain</a> and, besides, unfriendly to redistribution.
It’s essential for interfacing with the operating system and difficult
to work without. Even a dead tree reference book would suffice.</p>
  </li>
  <li>
    <p>Depending on what you’re building, you may still need specialized
tools. For instance, game development requires <a href="https://www.blender.org/">tools for editing art
assets</a>.</p>
  </li>
  <li>
    <p>There is no formal source control system. Git is excluded per the
issues noted in the announcement, and my next option, <a href="https://wiki.debian.org/UsingQuilt">Quilt</a>,
has similar limitations. However, <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code> <em>are</em> included,
and are sufficient for a kind of old-school, patch-based source
control. I’ve used it successfully when dogfooding w64devkit in a
fresh Windows installation.</p>
  </li>
</ul>

<h3 id="everything-else">Everything else</h3>

<p>As I said in my announcement, w64devkit includes a powerful text editor
that fulfills all text editing needs, from code to documentation. The
editor includes a tutorial (<code class="language-plaintext highlighter-rouge">vimtutor</code>) and complete, built-in manual
(<code class="language-plaintext highlighter-rouge">:help</code>) in case you’re not yet familiar with it.</p>

<p>What about navigation? Use the included <a href="https://github.com/universal-ctags/ctags">ctags</a> to generate a
tags database (<code class="language-plaintext highlighter-rouge">ctags -R</code>), then <a href="http://vimdoc.sourceforge.net/htmldoc/tagsrch.html#tagsrch.txt">jump instantly</a> to any
definition at any time. No need for <a href="https://old.reddit.com/r/vim/comments/b3yzq4/a_lsp_client_maintainers_view_of_the_lsp_protocol/">that Language Server Protocol
rubbish</a>. This does not mean you must laboriously type identifiers
as you work. Use <a href="https://georgebrock.github.io/talks/vim-completion/">built-in completion</a>!</p>

<p>Build system? That’s also covered, via a Windows-aware unix-like
environment that includes <code class="language-plaintext highlighter-rouge">make</code>. <a href="/blog/2017/08/20/">Learning how to use it</a> is a
breeze. Software is by its nature unavoidably complicated, so <a href="/blog/2017/03/30/">don’t
make it more complicated than necessary</a>.</p>

<p>What about debugging? Use the debugger, GDB. Performance problems? Use
the profiler, gprof. Inspect compiler output either by asking for it
(<code class="language-plaintext highlighter-rouge">-S</code>) or via the disassembler (<code class="language-plaintext highlighter-rouge">objdump -d</code>). No need to go online for
the <a href="https://godbolt.org/">Godbolt Compiler Explorer</a>, as slick as it is. If the compiler
output is insufficient, use <a href="/blog/2015/07/10/">SIMD intrinsics</a>. In the worst case
there are two different assemblers available. Real time graphics? Use an
operating system API like OpenGL, DirectX, or Vulkan.</p>

<p>w64devkit <em>really is</em> nearly everything you need in a <a href="https://www.youtube.com/watch?v=W3ml7cO96F0&amp;t=1h25m50s">single, no
nonsense, fully-<em>offline</em> package</a>! It’s difficult to emphasize this
point as much as I’d like. When interacting with the broader software
ecosystem, I often despair that <a href="https://www.youtube.com/watch?v=ZSRHeXYDLko">software development has lost its
way</a>. This distribution is my way of carving out an escape from some
of the insanity. As a C and C++ toolchain, w64devkit by default produces
lean, sane, trivially-distributable, offline-friendly artifacts. All
runtime components in the distribution are <a href="https://drewdevault.com/dynlib">static link only</a>,
so no need to distribute DLLs with your application either.</p>

<h3 id="customize-the-distribution-own-the-toolchain">Customize the distribution, own the toolchain</h3>

<p>While most users would likely stick to my published releases, building
w64devkit is a two-step process with a single build dependency, Docker.
Anyone can easily customize it for their own needs. Don’t care about
C++? Toss it to shave 20% off the distribution. Need to tune the runtime
for a specific microarchitecture? Tweak the compiler flags.</p>

<p>One of the intended strengths of open source is users can modify
software to suit their needs. With w64devkit, you <em>own the toolchain</em>
itself. It is <a href="https://research.swtch.com/deps">one of your dependencies</a> after all. Unfortunately
the build initially requires an internet connection even when working
from source tarballs, but at least it’s a one-time event.</p>

<p>If you choose to <a href="https://github.com/nothings/stb">take on dependencies</a>, and you build those
dependencies using w64devkit, all the better! You can tweak them to your
needs and choose precisely how they’re built. You won’t be relying on
the goodwill of internet randos nor the generosity of a free package
registry.</p>

<h3 id="customization-examples">Customization examples</h3>

<p>Building existing software using w64devkit is probably easier than
expected, particularly since much of it has already been “ported” to
MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure
scripts. They never work in w64devkit despite having everything they
technically need. So other than that, here’s a demonstration of building
some popular software.</p>

<p>One of <a href="/blog/2016/09/02/">my coworkers</a> uses his own version of <a href="https://www.chiark.greenend.org.uk/~sgtatham/putty/">PuTTY</a>
patched to play more nicely with Emacs. If you wanted to do the same,
grab the source tarball, unpack it using the provided tools, then in the
unpacked source:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make -C windows -f Makefile.mgw
</code></pre></div></div>

<p>You’ll have a custom-built putty.exe, as well as the other tools. If you
have any patches, apply those first!</p>

<p>Would you like to embed an extension language in your application? Lua
is a solid choice, in part because it’s such a well-behaved dependency.
After unpacking the source tarball:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make PLAT=mingw
</code></pre></div></div>

<p>This produces a complete Lua compiler, runtime, and library. It’s not
even necessary to use the Makefile, as it’s nearly as simple as “<code class="language-plaintext highlighter-rouge">cc
*.c</code>” — painless to integrate or embed into any project.</p>

<p>Do you enjoy NetHack? Perhaps you’d like to <a href="https://bilious.alt.org/">try a few of the custom
patches</a>. This one is a little more complicated, but I was able to
build NetHack 3.6.6 like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"
</code></pre></div></div>

<p>NetHack has <a href="https://wiki.gentoo.org/wiki/Gcc_10_porting_notes/fno_common">a bug necessitating <code class="language-plaintext highlighter-rouge">-fcommon</code></a>. If you have any
patches, apply them with <code class="language-plaintext highlighter-rouge">patch</code> before the last step. I won’t belabor it
here, but with just a little more effort I was also able to produce a
NetHack binary with curses support via <a href="https://pdcurses.org/">PDCurses</a> — statically-linked
of course.</p>

<p>How about my archive encryption tool, <a href="https://github.com/skeeto/enchive">Enchive</a>? The one that
<a href="/blog/2018/04/13/">even works with 16-bit DOS compilers</a>. It requires nothing special
at all!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make
</code></pre></div></div>

<p>w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM.
This means you can modify and recompile these tools without going
through the Docker build. Sadly <a href="https://frippery.org/busybox/">busybox-w32</a> cannot host itself,
though it’s close. I’d <em>love</em> if w64devkit could fully host itself, and
so Docker — and therefore an internet connection and such — would only
be needed to bootstrap, but unfortunately that’s not realistic given the
state of the GNU components.</p>

<h3 id="offline-and-reliable">Offline and reliable</h3>

<p>Software development has increasingly become <a href="https://deftly.net/posts/2017-06-01-measuring-the-weight-of-an-electron.html">dependent on a constant
internet connection</a>. Robust, offline tooling and development is
undervalued.</p>

<p>Consider: Does your current project depend on an external service? Do
you pay for this service to ensure that it remains up? If you pull your
dependencies from a repository, how much do you trust those who maintain
the packages? <a href="https://drewdevault.com/2020/02/06/Dependencies-and-maintainers.html">Do you even know their names?</a> What would be your
project’s fate if that service went down permanently? It will someday,
though hopefully only after your project is dead and forgotten. If you
have the ability to work permanently offline, then you already have
happy answers to all these questions.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Asynchronously Opening and Closing Files in Asyncio</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/09/04/"/>
    <id>urn:uuid:ae94da45-f65d-4c72-a10e-9e421ea843ec</id>
    <updated>2020-09-04T01:36:20Z</updated>
    <category term="c"/><category term="linux"/><category term="python"/><category term="asyncio"/>
    <content type="html">
      <![CDATA[<p>Python <a href="https://docs.python.org/3/library/asyncio.html">asyncio</a> has support for asynchronous networking,
subprocesses, and interprocess communication. However, it has nothing
for asynchronous file operations — opening, reading, writing, or
closing. This is likely in part because operating systems themselves
also lack these facilities. If a file operation takes a long time,
perhaps because the file is on a network mount, then the entire Python
process will hang. It’s possible to work around this, so let’s build a
utility that can asynchronously open and close files.</p>

<p>The usual way to work around the lack of operating system support for a
particular asynchronous operation is to <a href="http://docs.libuv.org/en/v1.x/design.html#file-i-o">dedicate threads to waiting on
those operations</a>. By using a thread pool, we can even avoid the
overhead of spawning threads when we need them. Plus asyncio is designed
to play nicely with thread pools anyway.</p>

<h3 id="test-setup">Test setup</h3>

<p>Before we get started, we’ll need some way to test that it’s working. We
need a slow file system. One thought is to <a href="/blog/2018/06/23/">use ptrace to intercept the
relevant system calls</a>, though this isn’t quite so simple. The
other threads need to continue running while the thread waiting on
<code class="language-plaintext highlighter-rouge">open(2)</code> is paused, but ptrace pauses the whole process. Fortunately
there’s a simpler solution anyway: <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code>.</p>

<p>Setting the <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code> environment variable to the name of a shared
object will cause the loader to load this shared object ahead of
everything else, allowing that shared object to override other
libraries. I’m on x86-64 Linux (Debian), and so I’m looking to override
<code class="language-plaintext highlighter-rouge">open64(2)</code> in glibc. Here’s my <code class="language-plaintext highlighter-rouge">open64.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;dlfcn.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span>
<span class="nf">open64</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">int</span> <span class="n">mode</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">"/tmp/"</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">sleep</span><span class="p">(</span><span class="mi">3</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="n">RTLD_NEXT</span><span class="p">,</span> <span class="s">"open64"</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">f</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">mode</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now Python must go through my C function when it opens files. If the
file resides where under <code class="language-plaintext highlighter-rouge">/tmp/</code>, opening the file will be delayed by 3
seconds. Since I still want to actually open a file, I use <code class="language-plaintext highlighter-rouge">dlsym()</code> to
access the <em>real</em> <code class="language-plaintext highlighter-rouge">open64()</code> in glibc. I build it like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -o open64.so open64.c -ldl
</code></pre></div></div>

<p>And to test that it works with Python, let’s time how long it takes to
open <code class="language-plaintext highlighter-rouge">/tmp/x</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ touch /tmp/x
$ time LD_PRELOAD=./open64.so python3 -c 'open("/tmp/x")'

real    0m3.021s
user    0m0.014s
sys     0m0.005s
</code></pre></div></div>

<p>Perfect! (Note: It’s a little strange putting <code class="language-plaintext highlighter-rouge">time</code> <em>before</em> setting the
environment variable, but that’s because I’m using Bash and it <code class="language-plaintext highlighter-rouge">time</code> is
special since this is the shell’s version of the command.)</p>

<h3 id="thread-pools">Thread pools</h3>

<p>Python’s standard <code class="language-plaintext highlighter-rouge">open()</code> is most commonly used as a <em>context manager</em>
so that the file is automatically closed no matter what happens.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'output.txt'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'hello world'</span><span class="p">,</span> <span class="nb">file</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
</code></pre></div></div>

<p>I’d like my asynchronous open to follow this pattern using <a href="https://www.python.org/dev/peps/pep-0492/"><code class="language-plaintext highlighter-rouge">async
with</code></a>. It’s like <code class="language-plaintext highlighter-rouge">with</code>, but the context manager is acquired and
released asynchronously. I’ll call my version <code class="language-plaintext highlighter-rouge">aopen()</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">with</span> <span class="n">aopen</span><span class="p">(</span><span class="s">'output.txt'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
    <span class="p">...</span>
</code></pre></div></div>

<p>So <code class="language-plaintext highlighter-rouge">aopen()</code> will need to return an <em>asynchronous context manager</em>, an
object with methods <code class="language-plaintext highlighter-rouge">__aenter__</code> and <code class="language-plaintext highlighter-rouge">__aexit__</code> that both return
<a href="https://docs.python.org/3/glossary.html#term-awaitable"><em>awaitables</em></a>. Usually this is by virtue of these methods being
<a href="https://docs.python.org/3/glossary.html#term-coroutine-function"><em>coroutine functions</em></a>, but a normal function that directly returns
an awaitable also works, which is what I’ll be doing for <code class="language-plaintext highlighter-rouge">__aenter__</code>.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">_AsyncOpen</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span>
        <span class="p">...</span>

    <span class="k">def</span> <span class="nf">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="p">...</span>

    <span class="k">async</span> <span class="k">def</span> <span class="nf">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc_type</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
        <span class="p">...</span>
</code></pre></div></div>

<p>Ultimately we have to call <code class="language-plaintext highlighter-rouge">open()</code>. The arguments for <code class="language-plaintext highlighter-rouge">open()</code> will be
given to the constructor to be used later. This will make more sense
when you see the definition for <code class="language-plaintext highlighter-rouge">aopen()</code>.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_args</span> <span class="o">=</span> <span class="n">args</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_kwargs</span> <span class="o">=</span> <span class="n">kwargs</span>
</code></pre></div></div>

<p>When it’s time to actually open the file, Python will call <code class="language-plaintext highlighter-rouge">__aenter__</code>.
We can’t call <code class="language-plaintext highlighter-rouge">open()</code> directly since that will block, so we’ll use a
thread pool to wait on it. Rather than create a thread pool, we’ll use
the one that comes with the current event loop. The <code class="language-plaintext highlighter-rouge">run_in_executor()</code>
method runs a function in a thread pool — where <code class="language-plaintext highlighter-rouge">None</code> means use the
default pool — returning an asyncio future representing the future
result, in this case the opened file object.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">def</span> <span class="nf">thread_open</span><span class="p">():</span>
            <span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="o">*</span><span class="bp">self</span><span class="p">.</span><span class="n">_args</span><span class="p">,</span> <span class="o">**</span><span class="bp">self</span><span class="p">.</span><span class="n">_kwargs</span><span class="p">)</span>
        <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_future</span> <span class="o">=</span> <span class="n">loop</span><span class="p">.</span><span class="n">run_in_executor</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">thread_open</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_future</span>
</code></pre></div></div>

<p>Since this <code class="language-plaintext highlighter-rouge">__aenter__</code> is not a coroutine function, it returns the
future directly as its awaitable result. The caller will await it.</p>

<p>The default thread pool is limited to one thread per core, which I
suppose is the most obvious choice, though not ideal here. That’s fine
for CPU-bound operations but not for I/O-bound operations. In a real
program we may want to use a larger thread pool.</p>

<p>Closing a file may block, so we’ll do that in a thread pool as well.
First pull the file object <a href="/blog/2020/07/30/">from the future</a>, then close it in the
thread pool, waiting until the file has actually closed:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">async</span> <span class="k">def</span> <span class="nf">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc_type</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
        <span class="nb">file</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="p">.</span><span class="n">_future</span>
        <span class="k">def</span> <span class="nf">thread_close</span><span class="p">():</span>
            <span class="nb">file</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="n">loop</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">get_event_loop</span><span class="p">()</span>
        <span class="k">await</span> <span class="n">loop</span><span class="p">.</span><span class="n">run_in_executor</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">thread_close</span><span class="p">)</span>
</code></pre></div></div>

<p>The open and close are paired in this context manager, but it may be
concurrent with an arbitrary number of other <code class="language-plaintext highlighter-rouge">_AsyncOpen</code> context
managers. There will be some upper limit to the number of open files, so
<strong>we need to be careful not to use too many of these things
concurrently</strong>, something <a href="/blog/2020/05/24/">which easily happens when using unbounded
queues</a>. Lacking back pressure, all it takes is for tasks to be
opening files slightly faster than they close them.</p>

<p>With all the hard work done, the definition for <code class="language-plaintext highlighter-rouge">aopen()</code> is trivial:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">aopen</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">_AsyncOpen</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">)</span>
</code></pre></div></div>

<p>That’s it! Let’s try it out with the <code class="language-plaintext highlighter-rouge">LD_PRELOAD</code> test.</p>

<h3 id="a-test-drive">A test drive</h3>

<p>First define a “heartbeat” task that will tell us the asyncio loop is
still chugging away while we wait on opening the file.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">heartbeat</span><span class="p">():</span>
    <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">'HEARTBEAT'</span><span class="p">)</span>
</code></pre></div></div>

<p>Here’s a test function for <code class="language-plaintext highlighter-rouge">aopen()</code> that asynchronously opens a file
under <code class="language-plaintext highlighter-rouge">/tmp/</code> named by an integer, (synchronously) writes that integer
to the file, then asynchronously closes it.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">write</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">aopen</span><span class="p">(</span><span class="sa">f</span><span class="s">'/tmp/</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="nb">file</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">main()</code> function creates the heartbeat task and opens 4 files
concurrently though the intercepted file opening routine:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">beat</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">heartbeat</span><span class="p">())</span>
    <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">asyncio</span><span class="p">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">write</span><span class="p">(</span><span class="n">i</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span>
    <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
    <span class="n">beat</span><span class="p">.</span><span class="n">cancel</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>The result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ LD_PRELOAD=./open64.so python3 aopen.py
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
$ cat /tmp/{1,2,3,4}
1
2
3
4
</code></pre></div></div>

<p>As expected, 6 heartbeats corresponding to 3 seconds that all 4 tasks
spent concurrently waiting on the intercepted <code class="language-plaintext highlighter-rouge">open()</code>. Here’s the full
source if you want to try it our for yourself:</p>

<p><a href="https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd">https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd</a></p>

<h3 id="caveat-no-asynchronous-reads-and-writes">Caveat: no asynchronous reads and writes</h3>

<p><em>Only</em> opening and closing the file is asynchronous. Read and writes are
unchanged, still fully synchronous and blocking, so this is only a half
solution. A full solution is not nearly as simple because asyncio is
async/await. Asynchronous reads and writes would require all new APIs
<a href="https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/">with different coloring</a>. You’d need an <code class="language-plaintext highlighter-rouge">aprint()</code> to complement
<code class="language-plaintext highlighter-rouge">print()</code>, and so on, each returning an <code class="language-plaintext highlighter-rouge">awaitable</code> to be awaited.</p>

<p>This is one of the unfortunate downsides of async/await. I strongly
prefer conventional, preemptive concurrency, <em>but</em> we don’t always have
that luxury.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Conventions for Command Line Options</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/08/01/"/>
    <id>urn:uuid:9be2ce0e-298e-4085-8789-49674aecfeeb</id>
    <updated>2020-08-01T00:34:23Z</updated>
    <category term="tutorial"/><category term="posix"/><category term="c"/><category term="python"/><category term="go"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=24020952">on Hacker News</a> and critiqued <a href="https://utcc.utoronto.ca/~cks/space/blog/unix/MyOptionsConventions">on
Wandering Thoughts</a> (<a href="https://utcc.utoronto.ca/~cks/space/blog/unix/UnixOptionsConventions">2</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/python/ArgparseSomeUnixNotes">3</a>).</em></p>

<p>Command line interfaces have varied throughout their brief history but
have largely converged to some common, sound conventions. The core
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap12.html">originates from unix</a>, and the Linux ecosystem extended it,
particularly via the GNU project. Unfortunately some tools initially
<em>appear</em> to follow the conventions, but subtly get them wrong, usually
for no practical benefit. I believe in many cases the authors simply
didn’t know any better, so I’d like to review the conventions.</p>

<!--more-->

<h3 id="short-options">Short Options</h3>

<p>The simplest case is the <em>short option</em> flag. An option is a hyphen —
specifically HYPHEN-MINUS U+002D — followed by one alphanumeric
character. Capital letters are acceptable. The letters themselves <a href="http://www.catb.org/~esr/writings/taoup/html/ch10s05.html">have
conventional meanings</a> and are worth following if possible.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -c
</code></pre></div></div>

<p>Flags can be grouped together into one program argument. This is both
convenient and unambiguous. It’s also one of those often missed details
when programs use hand-coded argument parsers, and the lack of support
irritates me.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -abc
program -acb
</code></pre></div></div>

<p>The next simplest case are short options that take arguments. The
argument follows the option.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -i input.txt -o output.txt
</code></pre></div></div>

<p>The space is optional, so the option and argument can be packed together
into one program argument. Since the argument is required, this is still
unambiguous. This is another often-missed feature in hand-coded parsers.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -iinput.txt -ooutput.txt
</code></pre></div></div>

<p>This does not prohibit grouping. When grouped, the option accepting an
argument must be last.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -abco output.txt
program -abcooutput.txt
</code></pre></div></div>

<p>This technique is used to create another category, <em>optional option
arguments</em>. The option’s argument can be optional but still unambiguous
so long as the space is always omitted when the argument is present.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"
</code></pre></div></div>

<p>Optional option arguments should be used judiciously since they can be
surprising, but they have their uses.</p>

<p>Options can typically appear in any order — something parsers often
achieve via <em>permutation</em> — but non-options typically follow options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b foo bar
program -b -a foo bar
</code></pre></div></div>

<p>GNU-style programs usually allow options and non-options to be mixed,
though I don’t consider this to be essential.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a foo -b bar
program foo -a -b bar
program foo bar -a -b
</code></pre></div></div>

<p>If a non-option looks like an option because it starts with a hyphen,
use <code class="language-plaintext highlighter-rouge">--</code> to demarcate options from non-options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -- -x foo bar
</code></pre></div></div>

<p>An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so <code class="language-plaintext highlighter-rouge">--</code> is less often
needed.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options
</code></pre></div></div>

<h3 id="long-options">Long options</h3>

<p>Since short options can be cryptic, and there are such a limited number
of them, more complex programs support long options. A long option
starts with two hyphens followed by one or more alphanumeric, lowercase
words. Hyphens separate words. Using two hyphens prevents long options
from being confused for grouped short options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --reverse --ignore-backups
</code></pre></div></div>

<p>Occasionally flags are paired with a mutually exclusive inverse flag
that begins with <code class="language-plaintext highlighter-rouge">--no-</code>. This avoids a future <em>flag day</em> where the
default is changed in the release that also adds the flag implementing
the original behavior.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --sort
program --no-sort
</code></pre></div></div>

<p>Long options can similarly accept arguments.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --output output.txt --block-size 1024
</code></pre></div></div>

<p>These may optionally be connected to the argument with an equals sign
<code class="language-plaintext highlighter-rouge">=</code>, much like omitting the space for a short option argument.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --output=output.txt --block-size=1024
</code></pre></div></div>

<p>Like before, this opens up the doors for optional option arguments. Due
to the required <code class="language-plaintext highlighter-rouge">=</code> this is still unambiguous.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --color --reverse
program --color=never --reverse
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">--</code> retains its original behavior of disambiguating option-like
non-option arguments:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program --reverse -- --foo bar
</code></pre></div></div>

<h3 id="subcommands">Subcommands</h3>

<p>Some programs, such as Git, have subcommands each with their own
options. The main program itself may still have its own options distinct
from subcommand options. The program’s options come before the
subcommand and subcommand options follow the subcommand. Options are
never permuted around the subcommand.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz
</code></pre></div></div>

<p>Above, the <code class="language-plaintext highlighter-rouge">-a</code>, <code class="language-plaintext highlighter-rouge">-b</code>, and <code class="language-plaintext highlighter-rouge">-c</code> options are for <code class="language-plaintext highlighter-rouge">program</code>, and the
others are for <code class="language-plaintext highlighter-rouge">subcommand</code>. So, really, the subcommand is another
command line of its own.</p>

<h3 id="option-parsing-libraries">Option parsing libraries</h3>

<p>There’s little excuse for not getting these conventions right assuming
you’re interested in following the conventions. Short options can be
parsed correctly in <a href="https://github.com/skeeto/getopt">just ~60 lines of C code</a>. Long options are
<a href="https://github.com/skeeto/optparse">just slightly more complex</a>.</p>

<p>GNU’s <code class="language-plaintext highlighter-rouge">getopt_long()</code> supports long option abbreviation — with no way to
disable it (!) — but <a href="https://utcc.utoronto.ca/~cks/space/blog/python/ArgparseAbbreviatedOptions">this should be avoided</a>.</p>

<p>Go’s <a href="https://golang.org/pkg/flag/">flag package</a> intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with <code class="language-plaintext highlighter-rouge">=</code>. It’s sound, but I miss both features
every time I write programs in Go. That’s why I <a href="https://github.com/skeeto/optparse-go">wrote my own argument
parser</a>. Not only does it have a nicer feature set, I like the API a
lot more, too.</p>

<p>Python’s primary option parsing library is <code class="language-plaintext highlighter-rouge">argparse</code>, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention <em>and</em> its behavior is unsound. For instance, the following
program has two options, <code class="language-plaintext highlighter-rouge">--foo</code> and <code class="language-plaintext highlighter-rouge">--bar</code>. The <code class="language-plaintext highlighter-rouge">--foo</code> option accepts
an optional argument, and the <code class="language-plaintext highlighter-rouge">--bar</code> option is a simple flag.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">import</span> <span class="nn">sys</span>

<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="p">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--foo'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">nargs</span><span class="o">=</span><span class="s">'?'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s">'X'</span><span class="p">)</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--bar'</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s">'store_true'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:]))</span>
</code></pre></div></div>

<p>Here are some example runs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')
</code></pre></div></div>

<p>Everything looks good except the last. If the <code class="language-plaintext highlighter-rouge">--foo</code> argument is
optional then why did it consume <code class="language-plaintext highlighter-rouge">arg</code>? What happens if I follow it with
<code class="language-plaintext highlighter-rouge">--bar</code>? Will it consume it as the argument?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py --foo --bar
Namespace(bar=True, foo=None)
</code></pre></div></div>

<p>Nope! Unlike <code class="language-plaintext highlighter-rouge">arg</code>, it left <code class="language-plaintext highlighter-rouge">--bar</code> alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument <em>looks
like</em> an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since <code class="language-plaintext highlighter-rouge">argparse</code> does not properly support <code class="language-plaintext highlighter-rouge">--</code>,
that does not help.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg
</code></pre></div></div>

<p>Please, stick to the conventions unless you have <em>really</em> good reasons
to break them!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Netpbm Animation Showcase</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/06/29/"/>
    <id>urn:uuid:282d487d-5840-4c30-9aa8-3d0d0f07bef2</id>
    <updated>2020-06-29T21:03:02Z</updated>
    <category term="c"/><category term="media"/>
    <content type="html">
      <![CDATA[<p>Ever since I worked out <a href="/blog/2017/11/03/">how to render video from scratch</a> some
years ago, it’s been an indispensable tool in my software development
toolbelt. It’s the first place I reach when I need to display some
graphics, even if it means having to do the rendering myself. I’ve used
it often in throwaway projects in a disposable sort of way. More
recently, though, I’ve kept better track of these animations since some
of them <em>are</em> pretty cool, and I’d like to look a them again. This post
is a showcase of some of these projects.</p>

<p>Each project is in a ready-to-run state of compile, then run with the
output piped into a media player or video encoding. The header includes
the exactly commands you need. Since that’s probably inconvenient for
most readers, I’ve included a pre-recorded sample of each. Though in a
few cases, especially those displaying random data, video encoding
really takes something away from the final result, and it may be worth
running yourself.</p>

<p>The projects are not in any particular order.</p>

<h3 id="randu">RANDU</h3>

<p><a href="https://nullprogram.com/video/?v=randu"><img src="/img/showcase/randu.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/randu.c">randu.c</a></p>

<p>This is a little demonstration of the poor quality of the <a href="https://en.wikipedia.org/wiki/RANDU">RANDU
pseudorandom number generator</a>. Note how the source embeds a
monospace font so that it can render the text in the corner. For the 3D
effect, it includes an orthographic projection function. This function
will appear again later since I tend to cannibalize my own projects.</p>

<h3 id="color-sorting">Color sorting</h3>

<p><a href="https://nullprogram.com/video/?v=colors-odd-even"><img src="/img/showcase/colorsort.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/colorsort.c">colorsort.c</a></p>

<p>The original idea came from <a href="https://old.reddit.com/r/woahdude/comments/73oz1x/from_chaos_to_order/">an old reddit post</a>.</p>

<h3 id="kruskal-maze-generator">Kruskal maze generator</h3>

<p><a href="https://nullprogram.com/video/?v=kruskal"><img src="/img/showcase/animaze.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animaze/animaze.c">animaze.c</a></p>

<p>This effect was invented by my current <a href="/blog/2016/09/02/">mentee student</a> while
working on maze / dungeon generation late last year. This particular
animation is my own implementation. It outputs Netpbm by default, but,
for both fun and practice, also includes an entire implementation <a href="/blog/2015/06/06/">in
OpenGL</a>. It’s enabled at compile time with <code class="language-plaintext highlighter-rouge">-DENABLE_GL</code> so long
as you have GLFW and GLEW (even on Windows!).</p>

<h3 id="sliding-rooks-puzzle">Sliding rooks puzzle</h3>

<p><a href="https://nullprogram.com/video/?v=rooks"><img src="/img/showcase/rooks.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/rooks.c">rooks.c</a></p>

<p>I wanted to watch an animated solution to <a href="https://possiblywrong.wordpress.com/2020/05/20/sliding-rooks-and-queens/">the sliding rooks
puzzle</a>. This program solves the puzzle using a bitboard, then
animates the solution. The rook images are embedded in the program,
compressed using a custom run-length encoding (RLE) scheme with a tiny
palette.</p>

<h3 id="glaubers-dynamics">Glauber’s dynamics</h3>

<p><a href="https://nullprogram.com/video/?v=magnet"><img src="/img/showcase/magnet.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/magnet.c">magnet.c</a></p>

<p>My own animation of <a href="http://bit-player.org/2019/glaubers-dynamics">Glauber’s dynamics</a> using a totally
unoriginal color palette.</p>

<h3 id="fire">Fire</h3>

<p><a href="https://nullprogram.com/video/?v=fire"><img src="/img/showcase/fire.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/fire.c">fire.c</a></p>

<p>This is the <a href="https://fabiensanglard.net/doom_fire_psx/">classic Doom fire animation</a>. I later <a href="/blog/2020/04/30/">implemented it
in WebGL</a> with a modified algorithm.</p>

<h3 id="mersenne-twister">Mersenne Twister</h3>

<p><a href="https://nullprogram.com/video/?v=mt19937-shuffle"><img src="/img/showcase/mt.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/mtvisualize.c">mtvisualize.c</a></p>

<p>A visualization of the Mersenne Twister pseudorandom number generator.
Not terribly interesting, so I almost didn’t include it.</p>

<h3 id="pixel-sorting">Pixel sorting</h3>

<p><a href="https://nullprogram.com/video/?v=pixelsort"><img src="/img/showcase/pixelsort.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/pixelsort.c">pixelsort.c</a></p>

<p>Another animation <a href="https://old.reddit.com/r/generative/comments/9o1plu/generative_pixel_sorting_variant/">inspired by a reddit post</a>. Starting from
the top-left corner, swap the current pixel to the one most like its
neighbors.</p>

<h3 id="random-walk-2d">Random walk (2D)</h3>

<p><a href="https://nullprogram.com/video/?v=walk2d"><img src="/img/showcase/walkers.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/walkers.c">walkers.c</a></p>

<p>Another reproduction of <a href="https://old.reddit.com/r/proceduralgeneration/comments/g49qwk/random_walkers_abstract_art/">a reddit post</a>. This is recent enough
that I’m using a <a href="/blog/2019/11/19/">disposable LCG</a>.</p>

<h3 id="manhattan-distance-voronoi-diagram">Manhattan distance Voronoi diagram</h3>

<p><a href="https://nullprogram.com/video/?v=voronoi"><img src="/img/showcase/voronoi.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/voronoi.c">voronoi.c</a></p>

<p>Another <a href="https://old.reddit.com/r/proceduralgeneration/comments/fuy6tk/voronoi_with_manhattan_distance_in_c/">reddit post</a>, though I think my version looks a lot
nicer. I like to play this one over and over on repeat with different
seeds.</p>

<h3 id="random-walk-3d">Random walk (3D)</h3>

<p><a href="https://nullprogram.com/video/?v=walk3d"><img src="/img/showcase/walk3d.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/walk3d.c">walk3d.c</a></p>

<p>Another <del>stolen idea</del> personal take <a href="https://old.reddit.com/r/proceduralgeneration/comments/geka1q/random_walking_in_3d/">on a reddit post</a>. This
features the orthographic projection function from the RANDU animation.
Video encoding makes a real mess of this one, and I couldn’t work out
encoding options to make it look nice, so this one looks a lot better
“in person.”</p>

<h3 id="lorenz-system">Lorenz system</h3>

<p><a href="https://nullprogram.com/video/?v=lorenz"><img src="/img/showcase/lorenz.jpg" alt="" /></a><br />
<strong>Source</strong>:  <a href="https://github.com/skeeto/scratch/blob/master/animation/lorenz.c">lorenz.c</a></p>

<p>A 3D animation I adapted from the 3D random walk above, meaning it uses
the same orthographic projection. I have <a href="/blog/2018/02/15/">a WebGL version of this
one</a>, but I like that I could do this in such a small
amount of code and without an existing rendering engine. Like before,
this is really damaged by video encoding and is best seen live.</p>

<p>Bonus: I made <a href="https://gist.github.com/skeeto/45d825c01b00c10452634933d03e766d">an obfuscated version</a> just to show how
small this can get!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>w64devkit: a Portable C and C++ Development Kit for Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/05/15/"/>
    <id>urn:uuid:d600d846-3692-474f-adbf-45db63079581</id>
    <updated>2020-05-15T03:43:04Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23292161">on Hacker News</a>.</em></p>

<p>As a computer engineer, my job is to use computers to solve important
problems. Ideally my solutions will be efficient, and typically that
means making the best use of the resources at hand. Quite often these
resources are machines running Windows and, despite my misgivings about
the platform, there is much to be gained by properly and effectively
leveraging it.</p>

<p>Sometimes <a href="/blog/2018/11/15/">targeting Windows while working from another platform</a>
is sufficient, but other times I must work on the platform itself. There
<a href="/blog/2016/06/13/">are various options available</a> for C development, and I’ve
finally formalized my own development kit: <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>.</p>

<!--more-->

<p>For most users, the value is in the <strong>78MiB .zip</strong> available in the
“Releases” on GitHub. This (relatively) small package includes a
state-of-the-art C and C++ compiler (<a href="http://mingw-w64.org/">latest GCC</a>), a <a href="https://www.vim.org/">powerful
text editor</a>, <a href="https://www.gnu.org/software/gdb/">debugger</a>, a <a href="https://www.nasm.us/">complete x86 assembler</a>,
and <a href="https://frippery.org/busybox/">miniature unix environment</a>. It’s “portable” in that there’s no
installation. Just unzip it and start using it in place. With w64devkit,
it literally takes a few seconds on any Windows to get up and running
with a fully-featured, fully-equipped, first-class <a href="https://sanctum.geek.nz/arabesque/unix-as-ide-introduction/">development
environment</a>.</p>

<p>The development kit is cross-compiled entirely from source using Docker,
though Docker is not needed to actually use it. The repository is just a
Dockerfile and some documentation. The only build dependency is Docker
itself. It’s also easy to customize it for your own personal use, or to
audit and build your own if, for whatever reason, you didn’t trust my
distribution. This is in stark contrast to Windows builds of most open
source software where the build process is typically undocumented,
under-documented, obtuse, or very complicated.</p>

<h3 id="from-script-to-docker">From script to Docker</h3>

<p>Publishing this is not necessarily a commitment to always keep w64devkit
up to date, but this Dockerfile <em>is</em> derived from (and replaces) a shell
script I’ve been using continuously <a href="/blog/2018/04/13/#a-better-alternative">for over two years now</a>. In
this period, every time GCC has made a release, I’ve built myself a new
development kit, so I’m already in the habit.</p>

<p>I’ve been using Docker on and off for about 18 months now. It’s an
oddball in that it’s something I learned on the job rather than my own
time. I formed an early impression that still basically holds: <strong>The
main purpose of Docker is to contain and isolate misbehaved software to
improve its reliability</strong>. Well-behaved, well-designed software benefits
little from containers.</p>

<p>My unusual application of Docker here is no exception. <a href="/blog/2017/03/30/">Most software
builds are needlessly complicated and fragile</a>, especially
Autoconf-based builds. Ironically, the worst configure scripts I’ve
dealt with come from GNU projects. They waste time on superfluous checks
(“Does your compiler define <code class="language-plaintext highlighter-rouge">size_t</code>?”) then produce a build that
doesn’t work anyway because you’re doing something slightly unusual.
Worst of all, despite my best efforts, the build will be contaminated by
the state of the system doing the build.</p>

<p>My original build script was fragile by extension. It would work on one
system, but not another due to some subtle environment change — a
slightly different system header that reveals a build system bug
(<a href="https://gcc.gnu.org/legacy-ml/gcc/2017-05/msg00219.html">example in GCC</a>), or the system doesn’t have a file at a certain
hard-coded absolute path that shouldn’t be hard-coded. Converting my
script to a Dockerfile locks these problems in place and makes builds
much more reliable and repeatable. The misbehavior is contained and
isolated by Docker.</p>

<p>Unfortunately it’s not <em>completely</em> contained. In each case I use make’s
<code class="language-plaintext highlighter-rouge">-j</code> option to parallelize the build since otherwise it would take
hours. Some of the builds have subtle race conditions, and some bad luck
in timing can cause a build to fail. Docker is good about picking up
where it left off, so it’s just a matter of trying again.</p>

<p>In one case a build failed because Bison and flex were not installed
even though they’re not normally needed. Some dependency isn’t expressed
correctly, and unlucky ordering leads to an unused <code class="language-plaintext highlighter-rouge">.y</code> file having the
wrong timestamp. Ugh. I’ve had this happen a lot more in Docker than
out, probably because file system operations are slow inside Docker and
it creates greater timing variance.</p>

<h3 id="other-tools">Other tools</h3>

<p>The README explains some of my decisions, but I’ll summarize a few here:</p>

<ul>
  <li>
    <p>Git. Important and useful, so I’d love to have it. But it has a weird
installation (many <a href="https://github.com/skeeto/w64devkit/issues/1">.zip-unfriendly symlinks</a>) tightly-coupled
with msys2, and its build system does not support cross-compilation.
I’d love to see a clean, straightforward rewrite of Git in a single,
appropriate implementation language. Imagine installing the latest Git
with <code class="language-plaintext highlighter-rouge">go get git-scm.com/git</code>. (<em>Update</em>: <a href="https://github.com/libgit2/libgit2/pull/5507">libgit2 is working on
it</a>!)</p>
  </li>
  <li>
    <p>Bash. It’s a much nicer interactive shell than BusyBox-w32 <code class="language-plaintext highlighter-rouge">ash</code>. But
the build system doesn’t support cross-compilation, and I’m not sure
it supports Windows without some sort of compatibility layer anyway.</p>
  </li>
  <li>
    <p>Emacs. Another powerful editor. But the build system doesn’t support
cross-compilation. It’s also <em>way</em> too big.</p>
  </li>
  <li>
    <p>Go. Tempting to toss it in, but <a href="/blog/2020/01/21/">Go already does this all correctly
and effectively</a>. It simply doesn’t require a specialized
distribution. It’s trivial to manage a complete Go toolchain with
nothing but Go itself on any system. People may say its language
design comes from the 1970s, but the tooling is decades ahead of
everyone else.</p>
  </li>
</ul>

<h3 id="alternatives">Alternatives</h3>

<p>For a long, long time Cygwin filled this role for me. However, I never
liked its bulky nature, the complete opposite of portable. Cygwin
processes always felt second-class on Windows, particularly in that it
has its own view of the file system compared to other Windows processes.
They could never fully cooperate. I also don’t like that there’s no
toolchain for cross-compiling with Cygwin as a target — e.g. compile
Cygwin binaries from Linux. Finally <a href="/blog/2017/11/30/">it’s been essentially obsoleted by
WSL</a> which matches or surpasses it on every front.</p>

<p>There’s msys and <a href="https://www.msys2.org/">msys2</a>, which are a bit lighter. However, I’m
still in an isolated, second-class environment with weird path
translation issues. These tools <em>do</em> have important uses, and it’s the
only way to compile most open source software natively on Windows. For
those builds that don’t support cross-compilation, it’s <em>the</em> only path
for producing Windows builds. It’s just not what I’m looking for when
developing my own software.</p>

<p><em>Update</em>: <a href="https://github.com/mstorsjo/llvm-mingw">llvm-mingw</a> is an eerily similar project using Docker
the same way, but instead builds LLVM.</p>

<h3 id="using-docker-for-other-builds">Using Docker for other builds</h3>

<p>I also <a href="https://github.com/skeeto/gnupg-windows-build">converted my GnuPG build script</a> to a Dockerfile. Of
course I don’t plan to actually <em>use</em> GnuPG on Windows. I just need it
<a href="/blog/2019/07/10/">for passphrase2pgp</a>, which I test against GnuPG. This tests the
Windows build.</p>

<p>In the future I may extend this idea to a few other tools I don’t intend
to include with w64devkit. If you have something in mind, you could use
my Dockerfiles as a kind of starter template.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to Read UTF-8 Passwords on the Windows Console</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/05/04/"/>
    <id>urn:uuid:338ca754-e19e-4ae0-add8-639d69967c22</id>
    <updated>2020-05-04T02:14:34Z</updated>
    <category term="win32"/><category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23064864">on Hacker News</a>.</em></p>

<p>Suppose you’re writing a command line program that <a href="/blog/2017/03/12/">prompts the user for
a password or passphrase</a>, and Windows is one of the supported
platforms (<a href="/blog/2018/04/13/">even very old versions</a>). This program uses <a href="/blog/2019/05/29/">UTF-8
for its string representation</a>, <a href="http://utf8everywhere.org/">as it should</a>, and so
ideally it receives the password from the user encoded as UTF-8. On most
platforms this is, for the most part, automatic. However, on Windows
finding the correct answer to this problem is a maze where all the signs
lead towards dead ends. I recently navigated this maze and found the way
out.</p>

<!--more-->

<p>I knew it was possible because <a href="/blog/2019/07/10/">my passphrase2pgp tool</a> has been
using the <a href="https://pkg.go.dev/golang.org/x/crypto/ssh/terminal">golang.org/x/crypto/ssh/terminal</a> package, which gets it
very nearly perfect. Though they were still fixing subtle bugs <a href="https://github.com/golang/crypto/commit/6d4e4cb37c7d6416dfea8472e751c7b6615267a6">as
recently as 6 months ago</a>.</p>

<p>The first step is to ignore just everything you find online, because
it’s either wrong or it’s solving a slightly different problem. I’ll
discuss the dead ends later and focus on the solution first. Ultimately
I want to implement this on Windows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Display prompt then read zero-terminated, UTF-8 password.</span>
<span class="c1">// Return password length with terminator, or zero on error.</span>
<span class="kt">int</span> <span class="nf">read_password</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prompt</span><span class="p">);</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">int</code> for the length rather than <code class="language-plaintext highlighter-rouge">size_t</code> because it’s a
password and should not even approach <code class="language-plaintext highlighter-rouge">INT_MAX</code>.</p>

<h3 id="the-correct-way">The correct way</h3>

<p>For the impatient:
<a href="https://github.com/skeeto/scratch/blob/master/misc/read-password-w32.c" class="download"><strong>complete, working, ready-to-use example</strong></a></p>

<p>On a unix-like system, the program would:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">open(2)</code> the special <code class="language-plaintext highlighter-rouge">/dev/tty</code> file for reading and writing</li>
  <li><code class="language-plaintext highlighter-rouge">write(2)</code> the prompt</li>
  <li><code class="language-plaintext highlighter-rouge">tcgetattr(3)</code> and <code class="language-plaintext highlighter-rouge">tcsetattr(3)</code> to disable <code class="language-plaintext highlighter-rouge">ECHO</code></li>
  <li><code class="language-plaintext highlighter-rouge">read(2)</code> a line of input</li>
  <li>Restore the old terminal attributes with <code class="language-plaintext highlighter-rouge">tcsetattr(3)</code></li>
  <li><code class="language-plaintext highlighter-rouge">close(2)</code> the file</li>
</ol>

<p>A great advantage of this approach is that it doesn’t depend on standard
input and standard output. Either or both can be redirected elsewhere,
and this function still interacts with the user’s terminal. The Windows
version will have the same advantage.</p>

<p>Despite some tempting shortcuts that don’t work, the steps on Windows
are basically the same but with different names. There are a couple
subtleties and extra steps. I’ll be ignoring errors in my code snippets
below, but the complete example has full error handling.</p>

<h4 id="create-console-handles">Create console handles</h4>

<p>Instead of <code class="language-plaintext highlighter-rouge">/dev/tty</code>, the program opens two files: <code class="language-plaintext highlighter-rouge">CONIN$</code> and
<code class="language-plaintext highlighter-rouge">CONOUT$</code> using <a href="https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilea"><code class="language-plaintext highlighter-rouge">CreateFileA()</code></a>. Note: The “A” stands for ANSI,
as opposed to “W” for wide (Unicode). This refers to the encoding of the
file name, not to how the file contents are encoded. <code class="language-plaintext highlighter-rouge">CONIN$</code> is opened
for both reading and writing because write permissions are needed to
change the console’s mode.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">CreateFileA</span><span class="p">(</span>
    <span class="s">"CONIN$"</span><span class="p">,</span>
    <span class="n">GENERIC_READ</span> <span class="o">|</span> <span class="n">GENERIC_WRITE</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="n">OPEN_EXISTING</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span>
<span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">ho</span> <span class="o">=</span> <span class="n">CreateFileA</span><span class="p">(</span>
    <span class="s">"CONOUT$"</span><span class="p">,</span>
    <span class="n">GENERIC_WRITE</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="n">OPEN_EXISTING</span><span class="p">,</span>
    <span class="mi">0</span><span class="p">,</span>
    <span class="mi">0</span>
<span class="p">);</span>
</code></pre></div></div>

<h4 id="print-the-prompt">Print the prompt</h4>

<p>To write the prompt, call <a href="https://docs.microsoft.com/en-us/windows/console/writeconsole"><code class="language-plaintext highlighter-rouge">WriteConsoleA()</code></a> on the output handle.
On its own, this assumes the prompt is plain ASCII (i.e. <code class="language-plaintext highlighter-rouge">"password:
"</code>), not UTF-8 (i.e. <code class="language-plaintext highlighter-rouge">"contraseña: "</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">WriteConsoleA</span><span class="p">(</span><span class="n">ho</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prompt</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>If the prompt may contain UTF-8 data, perhaps because it displays a
username or isn’t in English, you have two options:</p>

<ul>
  <li>Convert the prompt to UTF-16 and call <code class="language-plaintext highlighter-rouge">WriteConsoleW()</code> instead.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">SetConsoleOutputCP()</code> with <code class="language-plaintext highlighter-rouge">CP_UTF8</code> (65001). This is a global
(to the console) setting and should be restored when done.</li>
</ul>

<h4 id="disable-echo">Disable echo</h4>

<p>Next use <a href="https://docs.microsoft.com/en-us/windows/console/getconsolemode"><code class="language-plaintext highlighter-rouge">GetConsoleMode()</code></a> and <a href="https://docs.microsoft.com/en-us/windows/console/setconsolemode"><code class="language-plaintext highlighter-rouge">SetConsoleMode()</code></a> to
disable echo. The console usually has <code class="language-plaintext highlighter-rouge">ENABLE_PROCESSED_INPUT</code> already
set, which tells the console to handle CTRL-C and such, but I set it
explicitly just in case. I also set <code class="language-plaintext highlighter-rouge">ENABLE_LINE_INPUT</code> so that the user
can use backspace and so that the entire line is delivered at once.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">orig</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">GetConsoleMode</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">);</span>

<span class="n">DWORD</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">orig</span><span class="p">;</span>
<span class="n">mode</span> <span class="o">|=</span> <span class="n">ENABLE_PROCESSED_INPUT</span><span class="p">;</span>
<span class="n">mode</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">ENABLE_ECHO_INPUT</span><span class="p">;</span>
<span class="n">SetConsoleMode</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>There are reports that <code class="language-plaintext highlighter-rouge">ENABLE_LINE_INPUT</code> limits reads to 254 bytes,
but I was unable to reproduce it. My full example can read huge
passwords without trouble.</p>

<p>The old mode is saved in <code class="language-plaintext highlighter-rouge">orig</code> so that it can be restored later.</p>

<h4 id="read-the-password">Read the password</h4>

<p>Here’s where you have to pay the piper. As of the date of this article,
<strong>the Windows API offers no method for reading UTF-8 input from the
console</strong>. Give up on that hope now. If you use the “ANSI” functions to
read input under any configuration, they will to the usual Windows thing
of <em>silently mangling your input</em>.</p>

<p>So you <em>must</em> use the UTF-16 API, <a href="https://docs.microsoft.com/en-us/windows/console/readconsole"><code class="language-plaintext highlighter-rouge">ReadConsoleW()</code></a>, and then
<a href="/blog/2017/10/06/">encode it</a> yourself. Fortunately Win32 provides a UTF-8 encoder,
<a href="https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte"><code class="language-plaintext highlighter-rouge">WideCharToMultiByte()</code></a>, which will even handle surrogate pairs
for all those people who like putting <code class="language-plaintext highlighter-rouge">PILE OF POO</code> (<code class="language-plaintext highlighter-rouge">U+1F4A9</code>) in their
passwords:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SIZE_T</span> <span class="n">wbuf_len</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">wbuf</span><span class="p">);</span>
<span class="n">WCHAR</span> <span class="o">*</span><span class="n">wbuf</span> <span class="o">=</span> <span class="n">HeapAlloc</span><span class="p">(</span><span class="n">GetProcessHeap</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf_len</span><span class="p">);</span>
<span class="n">DWORD</span> <span class="n">nread</span><span class="p">;</span>
<span class="n">ReadConsoleW</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">,</span> <span class="n">len</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nread</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">wbuf</span><span class="p">[</span><span class="n">nread</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// truncate "\r\n"</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">WideCharToMultiByte</span><span class="p">(</span><span class="n">CP_UTF8</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">SecureZeroMemory</span><span class="p">(</span><span class="n">wbuf</span><span class="p">,</span> <span class="n">wbuf_len</span><span class="p">);</span>
<span class="n">HeapFree</span><span class="p">(</span><span class="n">GetProcessHeap</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">wbuf</span><span class="p">);</span>
</code></pre></div></div>

<p>I use <a href="https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlsecurezeromemory"><code class="language-plaintext highlighter-rouge">SecureZeroMemory()</code></a> to erase the UTF-16 version of the
password before freeing the buffer. The <code class="language-plaintext highlighter-rouge">+ 2</code> in the allocation is for
the CRLF line ending that will later be chopped off. The error handling
version checks that the input did indeed end with CRLF. Otherwise it was
truncated (too long).</p>

<h4 id="clean-up">Clean up</h4>

<p>Finally print a newline since the user-typed one wasn’t echoed, restore
the old console mode, close the console handles, and return the final
encoded length:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;
</code></pre></div></div>

<p>The error checking version doesn’t check for errors from any of these
functions since either they cannot fail, or there’s nothing reasonable
to do in the event of an error.</p>

<h3 id="dead-ends">Dead ends</h3>

<p>If you look around the Win32 API you might notice <code class="language-plaintext highlighter-rouge">SetConsoleCP()</code>. A
reasonable person might think that setting the “code page” to UTF-8
(<code class="language-plaintext highlighter-rouge">CP_UTF8</code>) might configure the console to encode input in UTF-8. The
good news is Windows will no longer mangle your input as before. The bad
news is that it will be mangled differently.</p>

<p>You might think you can use the CRT function <code class="language-plaintext highlighter-rouge">_setmode()</code> with
<code class="language-plaintext highlighter-rouge">_O_U8TEXT</code> on the <code class="language-plaintext highlighter-rouge">FILE *</code> connected to the console. This does nothing
useful. (The only use for <code class="language-plaintext highlighter-rouge">_setmode()</code> is with <code class="language-plaintext highlighter-rouge">_O_BINARY</code>, to disable
braindead character translation on standard input and output.) The best
you’ll be able to do with the CRT is the same sort of wide character
read using non-standard functions, followed by conversion to UTF-8.</p>

<p><a href="https://docs.microsoft.com/en-us/windows/win32/api/wincred/nf-wincred-creduicmdlinepromptforcredentialsa"><code class="language-plaintext highlighter-rouge">CredUICmdLinePromptForCredentials()</code></a> promises to be both a
mouthful of a function name, and a prepacked solution to this problem.
It only delivers on the first. This function seems to have broken some
time ago and nobody at Microsoft noticed — probably because <em>nobody has
ever used this function</em>. I couldn’t find a working example, nor a use
in any real application. When I tried to use it, I got a nonsense error
code it never worked. There’s a GUI version of this function that <em>does</em>
work, and it’s a viable alternative for certain situations, though not
mine.</p>

<p>At my most desperate, I hoped <code class="language-plaintext highlighter-rouge">ENABLE_VIRTUAL_TERMINAL_PROCESSING</code> would
be a magical switch. On Windows 10 it magically enables some ANSI escape
sequences. The documentation in no way suggests it <em>would</em> work, and I
confirmed by experimentation that it does not. Pity.</p>

<p>I spent a lot of time searching down these dead ends until finally
settling with <code class="language-plaintext highlighter-rouge">ReadConsoleW()</code> above. I hoped it would be more
automatic, but I’m glad I have at least <em>some</em> solution figured out.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>When Parallel: Pull, Don't Push</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/04/30/"/>
    <id>urn:uuid:ac12ef1d-299f-4edb-9eb1-5ed4dac1219c</id>
    <updated>2020-04-30T22:35:51Z</updated>
    <category term="optimization"/><category term="interactive"/><category term="javascript"/><category term="opengl"/><category term="media"/><category term="webgl"/><category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=23089729">on Hacker News</a>.</em></p>

<p>I’ve noticed a small pattern across a few of my projects where I had
vectorized and parallelized some code. The original algorithm had a
“push” approach, the optimized version instead took a “pull” approach.
In this article I’ll describe what I mean, though it’s mostly just so I
can show off some pretty videos, pictures, and demos.</p>

<!--more-->

<h3 id="sandpiles">Sandpiles</h3>

<p>A good place to start is the <a href="https://en.wikipedia.org/wiki/Abelian_sandpile_model">Abelian sandpile model</a>, which, like
many before me, completely <a href="https://xkcd.com/356/">captured</a> my attention for awhile.
It’s a cellular automaton where each cell is a pile of grains of sand —
a sandpile. At each step, any sandpile with more than four grains of
sand spill one grain into its four 4-connected neighbors, regardless of
the number of grains in those neighboring cell. Cells at the edge spill
their grains into oblivion, and those grains no longer exist.</p>

<p>With excess sand falling over the edge, the model eventually hits a
stable state where all piles have three or fewer grains. However, until
it reaches stability, all sorts of interesting patterns ripple though
the cellular automaton. In certain cases, the final pattern itself is
beautiful and interesting.</p>

<p>Numberphile has a great video describing how to <a href="https://www.youtube.com/watch?v=1MtEUErz7Gg">form a group over
recurrent configurations</a> (<a href="https://www.youtube.com/watch?v=hBdJB-BzudU">also</a>). In short, for any given grid
size, there’s a stable <em>identity</em> configuration that, when “added” to
any other element in the group will stabilize back to that element. The
identity configuration is a fractal itself, and has been a focus of
study on its own.</p>

<p>Computing the identity configuration is really just about running the
simulation to completion a couple times from certain starting
configurations. Here’s an animation of the process for computing the
64x64 identity configuration:</p>

<p><a href="https://nullprogram.com/video/?v=sandpiles-64"><img src="/img/identity-64-thumb.png" alt="" /></a></p>

<p>As a fractal, the larger the grid, the more self-similar patterns there
are to observe. There are lots of samples online, and the biggest I
could find was <a href="https://commons.wikimedia.org/wiki/File:Sandpile_group_identity_on_3000x3000_grid.png">this 3000x3000 on Wikimedia Commons</a>. But I wanted
to see one <em>that’s even bigger, damnit</em>! So, skipping to the end, I
eventually computed this 10000x10000 identity configuration:</p>

<p><a href="/img/identity-10000.png"><img src="/img/identity-10000-thumb.png" alt="" /></a></p>

<p>This took 10 days to compute using my optimized implementation:</p>

<p><a href="https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c">https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c</a></p>

<p>I picked an algorithm described <a href="https://codegolf.stackexchange.com/a/106990">in a code golf challenge</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f(ones(n)*6 - f(ones(n)*6))
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">f()</code> is the function that runs the simulation to a stable state.</p>

<p>I used <a href="/blog/2015/07/10/">OpenMP to parallelize across cores, and SIMD to parallelize
within a thread</a>. Each thread operates on 32 sandpiles at a time.
To compute the identity sandpile, each sandpile only needs 3 bits of
state, so this could potentially be increased to 85 sandpiles at a time
on the same hardware. The output format is my old mainstay, Netpbm,
<a href="/blog/2017/11/03/">including the video output</a>.</p>

<h4 id="sandpile-push-and-pull">Sandpile push and pull</h4>

<p>So, what do I mean about pushing and pulling? The naive approach to
simulating sandpiles looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each i in sandpiles {
    if input[i] &lt; 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
        for each j in neighbors {
            output[j] = output[j] + 1
        }
    }
}
</code></pre></div></div>

<p>As the algorithm examines each cell, it <em>pushes</em> results into
neighboring cells. If we’re using concurrency, that means multiple
threads of execution may be mutating the same cell, which requires
synchronization — locks, <a href="/blog/2014/09/02/">atomics</a>, etc. That much
synchronization is the death knell of performance. The threads will
spend all their time contending for the same resources, even if it’s
just false sharing.</p>

<p>The solution is to <em>pull</em> grains from neighbors:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each i in sandpiles {
    if input[i] &lt; 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
    }
    for each j in neighbors {
        if input[j] &gt;= 4 {
            output[i] = output[i] + 1
        }
    }
}
</code></pre></div></div>

<p>Each thread only modifies one cell — the cell it’s in charge of updating
— so no synchronization is necessary. It’s shader-friendly and should
sound familiar if you’ve seen <a href="/blog/2014/06/10/">my WebGL implementation of Conway’s Game
of Life</a>. It’s essentially the same algorithm. If you chase down
the various Abelian sandpile references online, you’ll eventually come
across a 2017 paper by Cameron Fish about <a href="http://people.reed.edu/~davidp/homepage/students/fish.pdf">running sandpile simulations
on GPUs</a>. He cites my WebGL Game of Life article, bringing
everything full circle. We had spoken by email at the time, and he
<a href="https://people.reed.edu/~davidp/web_sandpiles/">shared his <strong>interactive simulation</strong> with me</a>.</p>

<p>Vectorizing this algorithm is straightforward: Load multiple piles at
once, one per SIMD channel, and use masks to implement the branches. In
my code I’ve also unrolled the loop. To avoid bounds checking in the
SIMD code, I pad the state data structure with zeros so that the edge
cells have static neighbors and are no longer special.</p>

<h3 id="webgl-fire">WebGL Fire</h3>

<p>Back in the old days, one of the <a href="http://fabiensanglard.net/doom_fire_psx/">cool graphics tricks was fire
animations</a>. It was so easy to implement on limited hardware. In
fact, the most obvious way to compute it was directly in the
framebuffer, such as in <a href="/blog/2014/12/09/">the VGA buffer</a>, with no outside state.</p>

<p>There’s a heat source at the bottom of the screen, and the algorithm
runs from bottom up, propagating that heat upwards randomly. Here’s the
algorithm using traditional screen coordinates (top-left corner origin):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>func rand(min, max) // random integer in [min, max]

for each x, y from bottom {
    buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}
</code></pre></div></div>

<p>As a <em>push</em> algorithm it works fine with a single-thread, but
it doesn’t translate well to modern video hardware. So convert it to a
<em>pull</em> algorithm!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each x, y {
    sx = x + rand(-1, 1)
    sy = y + rand(1, 2)
    output[y][x] = input[sy][sx] - rand(0, 1)
}
</code></pre></div></div>

<p>Cells pull the fire upward from the bottom. Though this time there’s a
catch: <em>This algorithm will have subtly different results.</em></p>

<ul>
  <li>
    <p>In the original, there’s a single state buffer and so a flame could
propagate upwards multiple times in a single pass. I’ve compensated
here by allowing a flames to propagate further at once.</p>
  </li>
  <li>
    <p>In the original, a flame only propagates to one other cell. In this
version, two cells might pull from the same flame, cloning it.</p>
  </li>
</ul>

<p>In the end it’s hard to tell the difference, so this works out.</p>

<p><a href="https://nullprogram.com/webgl-fire/"><img src="/img/fire-thumb.png" alt="" /></a></p>

<p><a href="https://github.com/skeeto/webgl-fire/">source code and instructions</a></p>

<p>There’s still potentially contention in that <code class="language-plaintext highlighter-rouge">rand()</code> function, but this
can be resolved <a href="https://www.shadertoy.com/view/WttXWX">with a hash function</a> that takes <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> as
inputs.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Purgeable Memory Allocations for Linux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/29/"/>
    <id>urn:uuid:50300bbe-0939-4bcf-96ff-8fb96a9b12d5</id>
    <updated>2019-12-29T00:25:49Z</updated>
    <category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I saw (part of) a video, <a href="https://www.youtube.com/watch?v=9l0nWEUpg7s">OS hacking: Purgeable memory</a>, by
Andreas Kling who’s writing an operating system called <a href="https://github.com/SerenityOS/serenity">Serenity</a>
and recording videos his progress. In the video he implements
<em>purgeable memory</em> as <a href="https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/CachingandPurgeableMemory.html">found on some Apple platforms</a> by adding
special support in the kernel. A process tells the kernel that a
particular range of memory isn’t important, and so the kernel can
reclaim if it the system is under memory pressure — the memory is
purgeable.</p>

<p>Linux has a mechanism like this, <a href="http://man7.org/linux/man-pages/man2/madvise.2.html"><code class="language-plaintext highlighter-rouge">madvise(2)</code></a>, that allows
processes to provide hints to the kernel on how memory is expected to be
used. The flag of interest is <code class="language-plaintext highlighter-rouge">MADV_FREE</code>:</p>

<blockquote>
  <p>The application no longer requires the pages in the range specified by
<code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">len</code>. The kernel can thus free these pages, but the
freeing could be delayed until memory pressure occurs. For each of the
pages that has been marked to be freed but has not yet been freed, the
free operation will be canceled if the caller writes into the page.</p>
</blockquote>

<p>So, given this, I built a proof of concept / toy on top of <code class="language-plaintext highlighter-rouge">MADV_FREE</code>
that provides this functionality for Linux:</p>

<p><strong><a href="https://github.com/skeeto/purgeable">https://github.com/skeeto/purgeable</a></strong></p>

<p>It <a href="/blog/2018/11/15/">allocates anonymous pages</a> using <code class="language-plaintext highlighter-rouge">mmap(2)</code>. When the allocation
is “unlocked” — i.e. the process isn’t actively using it — its pages are
marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code> so that the kernel can reclaim them at any time.
To lock the allocation so that the process can safely make use of them,
the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> is canceled. This is all a little trickier than it sounds,
and that’s the subject of this article.</p>

<p>Note: There’s also <code class="language-plaintext highlighter-rouge">MADV_DONTNEED</code> which seems like it would fit the
bill, but <a href="https://www.youtube.com/watch?v=bg6-LVCHmGM#t=58m23s">it’s implemented incorrectly in Linux</a>. It <em>immediately</em>
frees the pages, and so it’s useless for implementing purgeable memory.</p>

<h3 id="purgeable-api">Purgeable API</h3>

<p>Before diving into the implementation, here’s the API. It’s <a href="/blog/2018/06/10/">just four
functions</a> with no structure definitions. The pointer used by the
API is the memory allocation itself. All the bookkeeping <a href="/blog/2017/01/08/">associated
with that pointer</a> is hidden away, out of sight from the API’s
consumer. The full documentation is in <a href="https://github.com/skeeto/purgeable/blob/master/purgeable.h"><code class="language-plaintext highlighter-rouge">purgeable.h</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_alloc</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_unlock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">purgeable_lock</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">purgeable_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The semantics are much like a C++ <code class="language-plaintext highlighter-rouge">weak_ptr</code> in that locking both
validates that the allocation is still available and creates a “strong”
reference to it that prevents it from being purged. Though unlike a weak
reference, the allocation is stickier. It will remain until the system is
actually under pressure, not just when the garbage collector happens to
run or the last strong reference is gone.</p>

<p>Here’s how it might be used to, say, store decoded PNG data that can
decompressed again if needed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">png</span> <span class="o">*</span><span class="n">png</span> <span class="o">=</span> <span class="n">png_load</span><span class="p">(</span><span class="s">"texture.png"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">png</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>

<span class="cm">/* ... */</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">*</span> <span class="mi">4</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">texture</span><span class="p">)</span> <span class="n">die</span><span class="p">();</span>
        <span class="n">png_decode_rgba</span><span class="p">(</span><span class="n">png</span><span class="p">,</span> <span class="n">texture</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">purgeable_lock</span><span class="p">(</span><span class="n">texture</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">purgeable_free</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
        <span class="n">texture</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">continue</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">glTexImage2D</span><span class="p">(</span>
        <span class="n">GL_TEXTURE_2D</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">width</span><span class="p">,</span> <span class="n">png</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
        <span class="n">GL_RGBA</span><span class="p">,</span> <span class="n">GL_UNSIGNED_BYTE</span><span class="p">,</span> <span class="n">texture</span>
    <span class="p">);</span>
    <span class="n">purgeable_unlock</span><span class="p">(</span><span class="n">texture</span><span class="p">);</span>
    <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Memory is allocated in a locked state since it’s very likely to be
immediately filled with data. The application should unlock it before
moving on with other tasks. The purgeable memory must always be freed
using <code class="language-plaintext highlighter-rouge">purgeable_free()</code>, even if <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> failed. This not only
frees the bookkeeping, but also releases the now-zero pages and the
mapping itself. Originally I had <code class="language-plaintext highlighter-rouge">purgeable_lock()</code> free the purgeable
memory on failure, but I felt this was clearer. There’s no technical
reason it couldn’t, though.</p>

<h3 id="purgeable-implementation">Purgeable Implementation</h3>

<p>The main challenge is that the kernel doesn’t necessarily treat the
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> range contiguously. It might reclaim just some pages, and do
so in an arbitrary order. In order to lock the region, each page must be
handled individually. Per the man page quoted above, reversing
<code class="language-plaintext highlighter-rouge">MADV_FREE</code> requires a write to each page — to either trigger a page
fault or set <a href="https://en.wikipedia.org/wiki/Dirty_bit">a dirty bit</a>.</p>

<p>The only way to tell if a page has been purged is to check if it’s been
filled with zeros. That’s easy if we’re sure a particular byte in the
page should be zero, but, since this is a library, the caller might just
store <em>anything</em> on these pages.</p>

<p>So here’s my solution: To unlock a page, look at the first byte on the
page. Remember whether or not it’s zero. If it’s zero, write a 1 into
that byte. Once this has been done for all pages, use <code class="language-plaintext highlighter-rouge">madvise(2)</code> to
mark them all <code class="language-plaintext highlighter-rouge">MADV_FREE</code>.</p>

<p>With this approach, the library only needs to track one bit of information
per page regardless of the page’s contents. Assuming 4kB pages, each 32kB
of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead.
Not too bad!</p>

<p>Locking purgeable memory is a little trickier. Again, each page must be
visited in turn, and if any page was purged, then the whole allocation is
considered lost. If the first byte was non-zero when unlocking, the
library checks that it’s still non-zero. If the first byte was zero when
unlocking, then it prepares to write a zero back into that byte, which
must currently be non-zero.</p>

<p>In either case, the <code class="language-plaintext highlighter-rouge">MADV_FREE</code> needs to be canceled using a write, so
the library <a href="/blog/2014/09/02/">does an atomic compare-and-swap</a> (CAS) to write the
correct byte into the page, <em>even if it’s the same value</em> in the
non-zero case. The atomic CAS is essential because <strong>it ensures the page
wasn’t purged between the check and the write, as both are done
together, atomically</strong>. If every page has the expected first byte, and
every CAS succeeded, then the purgeable memory has been successfully
locked.</p>

<p>As an optimization, the library could consider more than just the first
byte, and look at, say, the first <code class="language-plaintext highlighter-rouge">long int</code> on each page. The library
does less work when the page contains a non-zero value, and the chance of
an arbitrary 8-byte value being zero is much lower. However, I wanted to
avoid <a href="/blog/2018/07/20/#strict-aliasing">potential aliasing issues</a>, especially if this library were
to be embedded, so I passed on the idea.</p>

<h4 id="bookkeeping">Bookkeeping</h4>

<p>The bookkeeping data is stored just before the buffer returned as the
purgeable memory, and it’s never marked with <code class="language-plaintext highlighter-rouge">MADV_FREE</code>. Assuming 4kB
pages, for each 128MB of purgeable memory the library allocates one extra
anonymous page to track it. The number of pages in the allocation is
stored just before the purgeable memory as a <code class="language-plaintext highlighter-rouge">size_t</code>, and the rest is the
per-page bit table described above.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">purgeable_alloc</span><span class="p">(</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">);</span>
<span class="kt">size_t</span> <span class="n">numpages</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">];</span>
</code></pre></div></div>

<p>So the library can immediately find it starting from the purgeable memory
address. Here’s an illustration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      ,--- p
      |
      v
----------------------------------------------
|...Z|    |    |    |    |    |    |    |    |
----------------------------------------------
 ^  ^
 |  |
 |  `--- size_t numpages
 |
 `--- bit table
</code></pre></div></div>

<p>The downside is that buffer underflows in the application would easily
trample the <code class="language-plaintext highlighter-rouge">numpages</code> value because it’s located immediately adjacent. It
would be safer to move it to the <em>beginning</em> of the first page before the
purgeable memory, but this would have made bit table access more
complicated. While the region is locked, the contents of the bit table
don’t matter, so it won’t be damaged by an underflow. Another idea: put a
checksum alongside <code class="language-plaintext highlighter-rouge">numpages</code>. It could just be a simple <a href="/blog/2018/07/31/">integer
hash</a>.</p>

<p>This makes for a really slick API since the consumer doesn’t need to track
anything more than a single pointer, the address of the purgeable memory
allocation itself.</p>

<h3 id="worth-using">Worth using?</h3>

<p>I’m not quite sure how often I’d actually use purgeable memory in real
programs, especially in software intended to be portable. Each operating
system needs its own implementation, and this library is not portable
since it relies on interfaces and behaviors specific to Linux.</p>

<p>It also has a not-so-unlikely pathological case: Imagine a program that
makes two purgeable memory allocation, and they’re large enough that one
always evicts the other. The program would thrash back and forth
fighting itself as it used each allocation. Detecting this situation
might be difficult, especially as the number of purgeable memory
allocations increases.</p>

<p>Regardless, it’s another tool for my software toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Chunking Optimizations: Let the Knife Do the Work</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/12/09/"/>
    <id>urn:uuid:961086fa-46af-42d4-bd69-6f4a326a1505</id>
    <updated>2019-12-09T22:37:55Z</updated>
    <category term="c"/><category term="cpp"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>There’s an old saying, <a href="https://www.youtube.com/watch?v=bTee6dKpDB0"><em>let the knife do the work</em></a>. Whether
preparing food in the kitchen or whittling a piece of wood, don’t push
your weight into the knife. Not only is it tiring, you’re much more
likely to hurt yourself. Use the tool properly and little force will be
required.</p>

<p>The same advice also often applies to compilers.</p>

<p>Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of
data. The simplest approach would be to do it a byte at a time:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* XOR src into dst */</span>
<span class="kt">void</span>
<span class="nf">xor512a</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Maybe you benchmark it or you look at the assembly output, and the
results are disappointing. Your compiler did <em>exactly</em> what you asked
of it and produced code that performs 64 single-byte XOR operations
(GCC 9.2.0, x86-64, <code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512a:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="p">],</span> <span class="nb">cl</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">64</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The target architecture has wide registers so it could be doing <em>at
least</em> 8 bytes at a time. Since your compiler isn’t doing it, you
decide to chunk the work into 8 byte blocks yourself in an attempt to
manually implement a <em>chunking operation</em>. Here’s some <a href="https://old.reddit.com/r/C_Programming/comments/e83jzk/strange_gcc_compiler_bug_when_using_o2_or_higher/">real world
code</a> that does so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* WARNING: Broken, do not use! */</span>
<span class="kt">void</span>
<span class="nf">xor512b</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>You check the assembly output of this function, and it looks much
better. It’s now processing 8 bytes at a time, so it should be about 8
times faster than before.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512b:</span>
        <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nl">.L0:</span>    <span class="nf">mov</span>    <span class="nb">rcx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
        <span class="nf">xor</span>    <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="nb">rax</span><span class="o">*</span><span class="mi">8</span><span class="p">],</span> <span class="nb">rcx</span>
        <span class="nf">inc</span>    <span class="nb">rax</span>
        <span class="nf">cmp</span>    <span class="nb">rax</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">jne</span>    <span class="nv">.L0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Still, this machine has 16-byte wide registers (SSE2 <code class="language-plaintext highlighter-rouge">xmm</code>), so there
could be another doubling in speed. Oh well, this is good enough, so you
plug it into your program. But something strange happens: <strong>The output
is now wrong!</strong></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">dst</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span>
        <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">13</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">16</span>
    <span class="p">};</span>
    <span class="kt">uint32_t</span> <span class="n">src</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">36</span><span class="p">,</span> <span class="mi">49</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span>
        <span class="mi">81</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">121</span><span class="p">,</span> <span class="mi">144</span><span class="p">,</span> <span class="mi">169</span><span class="p">,</span> <span class="mi">196</span><span class="p">,</span> <span class="mi">225</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span>
    <span class="p">};</span>
    <span class="n">xor512b</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">src</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">dst</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Your program prints 1..16 as if <code class="language-plaintext highlighter-rouge">xor512b()</code> was never called. You check
over everything a dozen times, and you can’t find anything wrong. Even
crazier, if you disable optimizations then the bug goes away. It must be
some kind of compiler bug!</p>

<p>Investigating a bit more, you learn that the <code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>
option also fixes the bug. That’s because this program violates C strict
aliasing rules. An array of <code class="language-plaintext highlighter-rouge">uint32_t</code> was accessed as a <code class="language-plaintext highlighter-rouge">uint64_t</code>. As
an <a href="/blog/2018/07/20/#strict-aliasing">important optimization</a>, compilers are allowed to assume such
variables do not alias and generate code accordingly. Otherwise every
memory store could potentially modify any variable, which limits the
compiler’s ability to produce decent code.</p>

<p>The original version is fine because <code class="language-plaintext highlighter-rouge">char *</code>, including both <code class="language-plaintext highlighter-rouge">signed</code>
and <code class="language-plaintext highlighter-rouge">unsigned</code>, has a special exemption and may alias with anything. For
the same reason, using <code class="language-plaintext highlighter-rouge">char *</code> unnecessarily can also make your
programs slower.</p>

<p>What could you do to keep the chunking operation while not running afoul
of strict aliasing? Counter-intuitively, you could use <code class="language-plaintext highlighter-rouge">memcpy()</code>. Copy
the chunks into legitimate, local <code class="language-plaintext highlighter-rouge">uint64_t</code> variables, do the work, and
copy the result back out.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512c</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">8</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint64_t</span> <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">src</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">^=</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">memcpy</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">dst</span> <span class="o">+</span> <span class="n">i</span><span class="o">*</span><span class="mi">8</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">memcpy()</code> is a built-in function, your compiler knows its
semantics and can ultimately elide all that copying. The assembly
listing for <code class="language-plaintext highlighter-rouge">xor512c</code> is identical to <code class="language-plaintext highlighter-rouge">xor512b</code>, but it won’t go haywire
when integrated into a real program.</p>

<p>It works and it’s correct, but you can still do much better than this!</p>

<h3 id="letting-your-compiler-do-the-work">Letting your compiler do the work</h3>

<p>The problem is you’re forcing the knife and not letting it do the work.
There’s a constraint on your compiler that hasn’t been considered: It
must work correctly for overlapping inputs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">74</span><span class="p">]</span> <span class="o">=</span> <span class="p">{...};</span>
<span class="n">xor512a</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">buf</span> <span class="o">+</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div></div>

<p>In this situation, the byte-by-byte and chunked versions of the function
will have different results. That’s exactly why your compiler can’t do
the chunking operation itself. However, <em>you don’t care about this
situation</em> because the inputs never overlap.</p>

<p>Let’s revisit the first, simple implementation, but this time being
smarter about it. The <code class="language-plaintext highlighter-rouge">restrict</code> keyword indicates that the inputs
will not overlap, freeing your compiler of this unwanted concern.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">xor512d</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">dst</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">src</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">dst</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ps</span> <span class="o">=</span> <span class="n">src</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">64</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">pd</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">^=</span> <span class="n">ps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(Side note: Adding <code class="language-plaintext highlighter-rouge">restrict</code> to the manually chunked function,
<code class="language-plaintext highlighter-rouge">xor512b()</code>, will not fix it. Using <code class="language-plaintext highlighter-rouge">restrict</code> can never make an
incorrect program correct.)</p>

<p>Compiled with GCC 9.2.0 and <code class="language-plaintext highlighter-rouge">-O3</code>, the resulting unrolled code
processes 16-byte chunks at a time (<code class="language-plaintext highlighter-rouge">pxor</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm2</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm3</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm4</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x00</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm2</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x10</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm3</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">movdqu</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">]</span>
        <span class="nf">pxor</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm4</span>
        <span class="nf">movups</span>  <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mh">0x30</span><span class="p">],</span> <span class="nv">xmm0</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Compiled with Clang 9.0.0 with AVX-512 enabled in the target
(<code class="language-plaintext highlighter-rouge">-mavx512bw</code>), <em>it does the entire operation in a single, big chunk!</em></p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">xor512d:</span>
        <span class="nf">vmovdqu64</span>   <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">vpxorq</span>      <span class="nv">zmm0</span><span class="p">,</span> <span class="nv">zmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">vmovdqu64</span>   <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">zmm0</span>
        <span class="nf">vzeroupper</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>“Letting the knife do the work” means writing a correct program and
lifting unnecessary constraints so that the compiler can use whatever
chunk size is appropriate for the target.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>On-the-fly Linear Congruential Generator Using Emacs Calc</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/11/19/"/>
    <id>urn:uuid:13e56720-ef3a-4fa4-a4ff-0a6fef914504</id>
    <updated>2019-11-19T01:17:50Z</updated>
    <category term="emacs"/><category term="crypto"/><category term="optimization"/><category term="c"/><category term="java"/><category term="javascript"/>
    <content type="html">
      <![CDATA[<p>I regularly make throwaway “projects” and do a surprising amount of
programming in <code class="language-plaintext highlighter-rouge">/tmp</code>. For Emacs Lisp, the equivalent is the
<code class="language-plaintext highlighter-rouge">*scratch*</code> buffer. These are places where I can make a mess, and the
mess usually gets cleaned up before it becomes a problem. A lot of my
established projects (<a href="/blog/2019/03/22/">ex</a>.) start out in volatile storage and
only graduate to more permanent storage once the concept has proven
itself.</p>

<p>Throughout my whole career, this sort of throwaway experimentation has
been an important part of my personal growth, and I try to <a href="/blog/2016/09/02/">encourage it
in others</a>. Even if the idea I’m trying doesn’t pan out, I usually
learn something new, and occasionally it translates into an article here.</p>

<p>I also enjoy small programming challenges. One of the most abused
tools in my mental toolbox is the Monte Carlo method, and I readily
apply it to solve toy problems. Even beyond this, random number
generators are frequently a useful tool (<a href="/blog/2017/04/27/">1</a>, <a href="/blog/2019/07/22/">2</a>), so I
find myself reaching for one all the time.</p>

<p>Nearly every programming language comes with a pseudo-random number
generation function or library. Unfortunately the language’s standard
PRNG is usually a poor choice (C, <a href="https://arvid.io/2018/06/30/on-cxx-random-number-generator-quality/">C++</a>, <a href="https://lowleveldesign.org/2018/08/15/randomness-in-net/">C#</a>, <a href="https://grokbase.com/t/gg/golang-nuts/155f6kbb7a/go-nuts-why-are-high-bits-used-by-math-rand-helpers-instead-of-low-ones">Go</a>).
It’s probably mediocre quality, <a href="/blog/2018/05/27/">slower than it needs to be</a>
(<a href="https://grokbase.com/t/gg/golang-nuts/155f6kbb7a/go-nuts-why-are-high-bits-used-by-math-rand-helpers-instead-of-low-ones">also</a>), <a href="https://lists.freebsd.org/pipermail/svn-src-head/2013-July/049068.html">lacks reliable semantics or behavior between
implementations</a>, or is missing some other property I want. So I’ve
long been a fan of <em>BYOPRNG:</em> Bring Your Own Pseudo-random Number
Generator. Just embed a generator with the desired properties directly
into the program. The <a href="/blog/2017/09/21/">best non-cryptographic PRNGs today</a> are
tiny and exceptionally friendly to embedding. Though, depending on what
you’re doing, you might <a href="/blog/2019/04/30/">need to be creative about seeding</a>.</p>

<h3 id="crafting-a-prng">Crafting a PRNG</h3>

<p>On occasion I don’t have an established, embeddable PRNG in reach, and
I have yet to commit xoshiro256** to memory. Or maybe I want to use
a totally unique PRNG for a particular project. In these cases I make
one up. With just a bit of know-how it’s not too difficult.</p>

<p>Probably the easiest decent PRNG to code from scratch is the venerable
<a href="https://en.wikipedia.org/wiki/Linear_congruential_generator">Linear Congruential Generator</a> (LCG). It’s a simple recurrence
relation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x[1] = (x[0] * A + C) % M
</code></pre></div></div>

<p>That’s trivial to remember once you know the details. You only need to
choose appropriate values for <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">C</code>, and <code class="language-plaintext highlighter-rouge">M</code>. Done correctly, it
will be a <em>full-period</em> generator — a generator that visits a
permutation of each of the numbers between 0 and <code class="language-plaintext highlighter-rouge">M - 1</code>. The seed —
the value of <code class="language-plaintext highlighter-rouge">x[0]</code> — is chooses a starting position in this (looping)
permutation.</p>

<p><code class="language-plaintext highlighter-rouge">M</code> has a natural, obvious choice: a power of two matching the range of
operands, such as 2^32 or 2^64. With this the modulo operation is free
as a natural side effect of the computer architecture.</p>

<p>Choosing <code class="language-plaintext highlighter-rouge">C</code> also isn’t difficult. It must be co-prime with <code class="language-plaintext highlighter-rouge">M</code>, and
since <code class="language-plaintext highlighter-rouge">M</code> is a power of two, any odd number is valid. Even 1. In
theory choosing a small value like 1 is faster since the compiler
won’t need to embed a large integer in the code, but this difference
doesn’t show up in any micro-benchmarks I tried. If you want a cool,
unique generator, then choose a large random integer. More on that
below.</p>

<p>The tricky value is <code class="language-plaintext highlighter-rouge">A</code>, and getting it right is the linchpin of the
whole LCG. It must be coprime with <code class="language-plaintext highlighter-rouge">M</code> (i.e. not even), and, for a
full-period generator, <code class="language-plaintext highlighter-rouge">A-1</code> must be divisible by four. For better
results, <code class="language-plaintext highlighter-rouge">A-1</code> should not be divisible by 8. A good choice is a prime
number that satisfies these properties.</p>

<p>If your operands are 64-bit integers, or larger, how are you going to
generate a prime number?</p>

<h4 id="primes-from-emacs-calc">Primes from Emacs Calc</h4>

<p>Emacs Calc can solve this problem. I’ve <a href="/blog/2009/06/23/">noted before</a> how
featureful it is. It has arbitrary precision, random number
generation, and primality testing. It’s everything we need to choose
<code class="language-plaintext highlighter-rouge">A</code>. (In fact, this is nearly identical to <a href="/blog/2015/10/30/">the process I used to
implement RSA</a>.) For this example I’m going to generate a 64-bit
LCG for the C programming language, but it’s easy to use whatever
width you like and mostly whatever language you like. If you wanted a
<a href="http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html">minimal standard 128-bit LCG</a>, this will still work.</p>

<p>Start by opening up Calc with <code class="language-plaintext highlighter-rouge">M-x calc</code>, then:</p>

<ol>
  <li>Push <code class="language-plaintext highlighter-rouge">2</code> on the stack</li>
  <li>Push <code class="language-plaintext highlighter-rouge">64</code> on the stack</li>
  <li>Press <code class="language-plaintext highlighter-rouge">^</code>, computing 2^64 and pushing it on the stack</li>
  <li>Press <code class="language-plaintext highlighter-rouge">k r</code> to generate a random number in this range</li>
  <li>Press <code class="language-plaintext highlighter-rouge">d r 16</code> to switch to hexadecimal display</li>
  <li>Press <code class="language-plaintext highlighter-rouge">k n</code> to find the next prime following the random value</li>
  <li>Repeat step 6 until you get a number that ends with <code class="language-plaintext highlighter-rouge">5</code> or <code class="language-plaintext highlighter-rouge">D</code></li>
  <li>Press <code class="language-plaintext highlighter-rouge">k p</code> a few times to avoid false positives.</li>
</ol>

<p>What’s left on the stack is your <code class="language-plaintext highlighter-rouge">A</code>! If you want a random value for
<code class="language-plaintext highlighter-rouge">C</code>, you can follow a similar process. Heck, make it prime, too!</p>

<p>The reason for using hexadecimal (step 5) and looking for <code class="language-plaintext highlighter-rouge">5</code> or <code class="language-plaintext highlighter-rouge">D</code>
(step 7) is that such numbers satisfy both of the important properties
for <code class="language-plaintext highlighter-rouge">A-1</code>.</p>

<p>Calc doesn’t try to factor your random integer. Instead it uses the
<a href="https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test">Miller–Rabin primality test</a>, a probabilistic test that, itself,
requires random numbers. It has false positives but no false negatives.
The false positives can be mitigated by repeating the test multiple
times, hence step 8.</p>

<p>Trying this all out right now, I got this implementation (in C):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="nf">lcg1</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, we can still do a little better. Outputting the entire state
doesn’t have great results, so instead it’s better to create a
<em>truncated</em> LCG and only return some portion of the most significant
bits.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">lcg2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This won’t quite pass <a href="http://simul.iro.umontreal.ca/testu01/tu01.html">BigCrush</a> in 64-bit form, but the results
are pretty reasonable for most purposes.</p>

<p>But we can still do better without needing to remember much more than
this.</p>

<h3 id="appending-permutation">Appending permutation</h3>

<p>A <a href="http://www.pcg-random.org/">Permuted Congruential Generator</a> (PCG) is really just a
truncated LCG with a permutation applied to its output. Like LCGs
themselves, there are arbitrarily many variations. The “official”
implementation has a <a href="/blog/2018/02/07/">data-dependent shift</a>, for which I can
never remember the details. Fortunately a couple of simple, easy to
remember transformations is sufficient. Basically anything I used
<a href="/blog/2018/07/31/">while prospecting for hash functions</a>. I love xorshifts, so
lets add one of those:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">pcg1</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is a big improvement, but it still fails one BigCrush test. As
they say, when xorshift isn’t enough, use xorshift-multiply! Below I
generated a 32-bit prime for the multiply, but any odd integer is a
valid permutation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">pcg2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x7c3c3267d015ceb5</span><span class="p">)</span> <span class="o">+</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x24bd2d95276253a9</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">r</span> <span class="o">*=</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x60857ba9</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This passes BigCrush, and I can reliably build a new one entirely from
scratch using Calc any time I need it.</p>

<h3 id="bonus-adapting-to-other-languages">Bonus: Adapting to other languages</h3>

<p>Sometimes it’s not so straightforward to adapt this technique to other
languages. For example, JavaScript has limited support for 32-bit
integer operations (enough for a poor 32-bit LCG) and no 64-bit
integer operations. Though <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt">BigInt</a> is now a thing, and should
make a great 96- or 128-bit LCG easy to build.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">lcg</span><span class="p">(</span><span class="nx">seed</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">let</span> <span class="nx">s</span> <span class="o">=</span> <span class="nx">BigInt</span><span class="p">(</span><span class="nx">seed</span><span class="p">);</span>
    <span class="k">return</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
        <span class="nx">s</span> <span class="o">*=</span> <span class="mh">0xef725caa331524261b9646cd</span><span class="nx">n</span><span class="p">;</span>
        <span class="nx">s</span> <span class="o">+=</span> <span class="mh">0x213734f2c0c27c292d814385</span><span class="nx">n</span><span class="p">;</span>
        <span class="nx">s</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffffffffffffffffff</span><span class="nx">n</span><span class="p">;</span>
        <span class="k">return</span> <span class="nb">Number</span><span class="p">(</span><span class="nx">s</span> <span class="o">&gt;&gt;</span> <span class="mi">64</span><span class="nx">n</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Java doesn’t have unsigned integers, so how could you build the above
PCG in Java? Easy! First, remember is that Java has two’s complement
semantics, including wrap around, and that two’s complement doesn’t
care about unsigned or signed for multiplication (or addition, or
subtraction). The result is identical. Second, the oft-forgotten <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code>
operator does an unsigned right shift. With these two tips:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>

<span class="kt">int</span> <span class="nf">pcg2</span><span class="o">()</span> <span class="o">{</span>
    <span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">*</span><span class="mh">0x7c3c3267d015ceb5</span><span class="no">L</span> <span class="o">+</span> <span class="mh">0x24bd2d95276253a9</span><span class="no">L</span><span class="o">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">(</span><span class="kt">int</span><span class="o">)(</span><span class="n">s</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">32</span><span class="o">);</span>
    <span class="n">r</span> <span class="o">^=</span> <span class="n">r</span> <span class="o">&gt;&gt;&gt;</span> <span class="mi">16</span><span class="o">;</span>
    <span class="n">r</span> <span class="o">*=</span> <span class="mh">0x60857ba9</span><span class="o">;</span>
    <span class="k">return</span> <span class="n">r</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div>

<p>So, in addition to the Calc step list above, you may need to know some
of the finer details of your target language.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Infectious Executable Stacks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/11/15/"/>
    <id>urn:uuid:7266b2ea-f39e-4b9a-87c8-e4480374af41</id>
    <updated>2019-11-15T03:29:37Z</updated>
    <category term="c"/><category term="netsec"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21553882">on Hacker News</a></em>.</p>

<p>In software development there are many concepts that at first glance
seem useful and sound, but, after considering the consequences of their
implementation and use, are actually horrifying. Examples include
<a href="https://lwn.net/Articles/683118/">thread cancellation</a>, <a href="/blog/2019/10/27/">variable length arrays</a>, and <a href="/blog/2018/07/20/#strict-aliasing">memory
aliasing</a>. GCC’s closure extension to C is another, and this
little feature compromises the entire GNU toolchain.</p>

<!--more-->

<h3 id="gnu-c-nested-functions">GNU C nested functions</h3>

<p>GCC has its own dialect of C called GNU C. One feature unique to GNU C
is <em>nested functions</em>, which allow C programs to define functions inside
other functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">intsort1</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The nested function above is straightforward and harmless. It’s nothing
groundbreaking, and it is trivial for the compiler to implement. The
<code class="language-plaintext highlighter-rouge">cmp</code> function is really just a static function whose scope is limited
to the containing function, no different than a local static variable.</p>

<p>With one slight variation the nested function turns into a closure. This
is where things get interesting:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">intsort2</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
        <span class="k">return</span> <span class="n">invert</span> <span class="o">?</span> <span class="o">-</span><span class="n">r</span> <span class="o">:</span> <span class="n">r</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">cmp</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">invert</code> variable from the outer scope is accessed from the inner
scope. This has <a href="/blog/2019/09/25/">clean, proper closure semantics</a> and works
correctly just as you’d expect. It fits quite well with traditional C
semantics. The closure itself is re-entrant and thread-safe. It’s
automatically (read: stack) allocated, and so it’s automatically freed
when the function returns, including when the stack is unwound via
<code class="language-plaintext highlighter-rouge">longjmp()</code>. It’s a natural progression to support closures like this
via nested functions. The eventual caller, <code class="language-plaintext highlighter-rouge">qsort</code>, doesn’t even know
it’s calling a closure!</p>

<p>While this seems so useful and easy, its implementation has serious
consequences that, in general, outweigh its benefits. In fact, in order
to make this work, the whole GNU toolchain has been specially rigged!</p>

<p>How does it work? The function pointer, <code class="language-plaintext highlighter-rouge">cmp</code>, passed to <code class="language-plaintext highlighter-rouge">qsort</code> must
somehow be associated with its lexical environment, specifically the
<code class="language-plaintext highlighter-rouge">invert</code> variable. A static address won’t do. When I <a href="/blog/2017/01/08/">implemented
closures as a toy library</a>, I talked about the function address for
each closure instance somehow needing to be unique.</p>

<p>GCC accomplishes this by constructing a trampoline on the stack. That
trampoline has access to the local variables stored adjacent to it, also
on the stack. GCC also generates a normal <code class="language-plaintext highlighter-rouge">cmp</code> function, like the
simple nested function before, that accepts <code class="language-plaintext highlighter-rouge">invert</code> as an additional
argument. The trampoline calls this function, passing the local variable
as this additional argument.</p>

<h3 id="trampoline-illustration">Trampoline illustration</h3>

<p>To illustrate this, I’ve manually implemented <code class="language-plaintext highlighter-rouge">intsort2()</code> below for
x86-64 (<a href="https://wiki.osdev.org/System_V_ABI">System V ABI</a>) without using GCC’s nested function
extension:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">a</span> <span class="o">-</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">invert</span> <span class="o">?</span> <span class="o">-</span><span class="n">r</span> <span class="o">:</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">intsort3</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">_Bool</span> <span class="n">invert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fp</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">cmp</span><span class="p">;</span>
    <span class="k">volatile</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="c1">// mov  edx, invert</span>
        <span class="mh">0xba</span><span class="p">,</span> <span class="n">invert</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span>
        <span class="c1">// mov  rax, cmp</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0xb8</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">,</span>
                    <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">40</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">48</span><span class="p">,</span> <span class="n">fp</span> <span class="o">&gt;&gt;</span> <span class="mi">56</span><span class="p">,</span>
        <span class="c1">// jmp  rax</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xe0</span>
    <span class="p">};</span>
    <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">trampoline</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">base</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">base</span><span class="p">),</span> <span class="n">trampoline</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s a complete example you can try yourself on nearly any x86-64
unix-like system: <a href="/download/trampoline.c"><strong>trampoline.c</strong></a>. It even works with Clang. The
two notable systems where stack trampolines won’t work are
<a href="https://marc.info/?l=openbsd-cvs&amp;m=149606868308439&amp;w=2">OpenBSD</a> and <a href="https://github.com/microsoft/WSL/issues/286">WSL</a>.</p>

<p>(Note: The <code class="language-plaintext highlighter-rouge">volatile</code> is necessary because C compilers rightfully do
not see the contents of <code class="language-plaintext highlighter-rouge">buf</code> as being consumed. Execution of the
contents isn’t considered.)</p>

<p>In case you hadn’t already caught it, there’s a catch. The linker needs
to link a binary that asks the loader for an executable stack (<code class="language-plaintext highlighter-rouge">-z
execstack</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c
</code></pre></div></div>

<p>That’s because <code class="language-plaintext highlighter-rouge">buf</code> contains x86 code implementing the trampoline:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>  <span class="nb">edx</span><span class="p">,</span> <span class="nv">invert</span>    <span class="c1">; assign third argument</span>
<span class="nf">mov</span>  <span class="nb">rax</span><span class="p">,</span> <span class="nv">cmp</span>       <span class="c1">; store cmp address in RAX register</span>
<span class="nf">jmp</span>  <span class="nb">rax</span>            <span class="c1">; jump to cmp</span>
</code></pre></div></div>

<p>(Note: The absolute jump through a 64-bit register is necessary because
the trampoline on the stack and the jump target will be very far apart.
Further, these days the program will likely be compiled as a Position
Independent Executable (PIE), so <code class="language-plaintext highlighter-rouge">cmp</code> <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">might itself have an high
address</a> rather than load into the lowest 32 bits of the address
space.)</p>

<p>However, executable stacks were phased out ~15 years ago because it
makes buffer overflows so much more dangerous! Attackers can inject
and execute whatever code they like, typically <em>shellcode</em>. That’s why
we need this unusual linker option.</p>

<p>You can see that the stack will be executable using our old friend,
<code class="language-plaintext highlighter-rouge">readelf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -l a.out
...
  GNU_STACK  0x00000000 0x00000000 0x00000000
             0x00000000 0x00000000 RWE   0x10
...
</code></pre></div></div>

<p>Note the “RWE” at the bottom right, meaning read-write-execute. This is
a really bad sign in a real binary. Do any binaries installed on your
system right now have an executable stack? <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944817">I found one on mine</a>.
(Update: <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944971">A major one was found in the comments by Walter Misar</a>.)</p>

<p>When compiling the original version using a nested function there’s no
need for that special linker option. That’s because GCC saw that it
would need an executable stack and used this option automatically.</p>

<p>Or, more specifically, GCC <em>stopped</em> requesting a non-executable stack
in the object file it produced. For the GNU Binutils linker, <strong>the
default is an executable stack.</strong></p>

<h3 id="fail-open-design">Fail open design</h3>

<p>Since this is the default, the only way to get a non-executable stack is
if <em>every</em> object file input to the linker explicitly declares that it
does not need an executable stack. To request a non-executable stack, an
object file <a href="https://www.airs.com/blog/archives/518">must contain the (empty) section <code class="language-plaintext highlighter-rouge">.note.GNU-stack</code></a>.
If even a single object file fails to do this, then the final program
gets an executable stack.</p>

<p>Not only does one contaminated object file infect the binary, everything
dynamically linked with it <em>also</em> gets an executable stack. Entire
processes are infected! This occurs even via <code class="language-plaintext highlighter-rouge">dlopen()</code>, where the stack
is dynamically made executable to accomodate the new shared object.</p>

<p>I’ve been bit myself. In <a href="/blog/2016/11/15/"><em>Baking Data with Serialization</em></a> I did
it completely by accident, and I didn’t notice my mistake until three
years later. The GNU linker outputs object files without the special
note by default even though the object file only contains data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo hello world &gt;hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$
</code></pre></div></div>

<p>This is fixed with <code class="language-plaintext highlighter-rouge">-z noexecstack</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
  [ 2] .note.GNU-stack  PROGBITS  00000000  0000004c
$
</code></pre></div></div>

<p>This may happen any time you link object files not produced by GCC, such
as output <a href="/blog/2015/04/19/">from the NASM assembler</a> or <a href="/blog/2016/11/17/">hand-crafted object
files</a>.</p>

<p>Nested C closures are super slick, but they’re just not worth the risk
of an executable stack, and they’re certainly not worth an entire
toolchain being fail open about it.</p>

<p>Update: A <a href="http://verisimilitudes.net/2019-11-21">rebuttal</a>. My short response is that the issue
discussed in my article isn’t really about C the language but rather
about an egregious issue with one particular toolchain. The problem
doesn’t even arise if you use only C, but instead when linking in object
files specifically <em>not</em> derived from C code.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Legitimate-ish Use of alloca()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/10/28/"/>
    <id>urn:uuid:ce906d6f-b228-4dc6-bd02-34b845d3c5e2</id>
    <updated>2019-10-28T00:42:23Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21374863">on Hacker News</a></em>.</p>

<p>Yesterday <a href="/blog/2019/10/27/">I wrote about a legitimate use for variable length
arrays</a>. While recently discussing this topic with <a href="/blog/2016/09/02/">a
co-worker</a>, I also thought of a semi-legitimate use for
<a href="http://man7.org/linux/man-pages/man3/alloca.3.html"><code class="language-plaintext highlighter-rouge">alloca()</code></a>, a non-standard “function” for dynamically allocating
memory on the stack.</p>

<!--more-->

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloca</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>I say “function” in quotes because it’s not truly a function and cannot
be implemented as a function or by a library. It’s implemented in the
compiler and is essentially part of the language itself. It’s a tool
allowing a function to manipulate its own stack frame.</p>

<p>Like VLAs, it has the problem that if you’re able to use <code class="language-plaintext highlighter-rouge">alloca()</code>
safely, then you really don’t need it in the first place. Allocation
failures are undetectable and once they happen it’s already too late.</p>

<h3 id="opaque-structs">Opaque structs</h3>

<p>To set the scene, let’s talk about opaque structs. Suppose you’re
writing a C library with <a href="/blog/2018/06/10/">a clean interface</a>. It’s set up so that
changing your struct fields won’t break the Application Binary Interface
(ABI), and callers are largely unable to depend on implementation
details, even by accident. To achieve this, it’s likely you’re making
use of <em>opaque structs</em> in your interface. Callers only ever receive
pointers to library structures, which are handed back into the interface
when they’re used. The internal details are hidden away.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* opaque float stack API */</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span>          <span class="nf">stack_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">int</span>           <span class="nf">stack_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">,</span> <span class="kt">float</span> <span class="n">v</span><span class="p">);</span>
<span class="kt">float</span>         <span class="nf">stack_pop</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Callers can use the API above without ever knowing the layout or even
the size of <code class="language-plaintext highlighter-rouge">struct stack</code>. Only a pointer to the struct is ever needed.
However, in order for this to work, the library must allocate the struct
itself. If this is a concern, then the library will typically allow the
caller to supply an allocator via function pointers. To see a really
slick version of this in practice, check out <a href="https://www.lua.org/manual/5.3/manual.html#lua_Alloc">Lua’s <code class="language-plaintext highlighter-rouge">lua_Alloc</code></a>, a
single function allocator API.</p>

<p>Suppose we wanted to support something simpler: The library will
advertise the size of the struct so the caller can allocate it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* API additions */</span>
<span class="kt">size_t</span> <span class="nf">stack_sizeof</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span>   <span class="nf">stack_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// like stack_create()</span>
<span class="kt">void</span>   <span class="nf">stack_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">);</span>  <span class="c1">// like stack_destroy()</span>
</code></pre></div></div>

<p>The implementation of <code class="language-plaintext highlighter-rouge">stack_sizeof()</code> would literally just be <code class="language-plaintext highlighter-rouge">return
sizeof struct stack</code>. The caller might use it like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="n">stack_free</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, that’s still a heap allocation. If this wasn’t an opaque
struct, the caller could very naturally use automatic (i.e. stack)
allocation, which is likely even preferred in this case. Is this still
possible? Idea: Allocate it via a generic <code class="language-plaintext highlighter-rouge">char</code> array (VLA in this
case).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>However, this is technically undefined behavior. While a <code class="language-plaintext highlighter-rouge">char</code> pointer
is special and permitted to alias with anything, the inverse isn’t true.
Pointers to other types don’t get a free pass to alias with a <code class="language-plaintext highlighter-rouge">char</code>
array. Accessing a <code class="language-plaintext highlighter-rouge">char</code> value as if it were a different type just
isn’t allowed. Why? Because the standard says so. If you want one of the
practical reasons: the alignment might be incorrect.</p>

<p>Hmmm, is there another option? Maybe with <code class="language-plaintext highlighter-rouge">alloca()</code>!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">stack_sizeof</span><span class="p">();</span>
<span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">len</code> is expected to be small, it’s not any less safe than the
non-opaque alternative. It doesn’t undermine the type system, either,
since <code class="language-plaintext highlighter-rouge">alloca()</code> has the same semantics as <code class="language-plaintext highlighter-rouge">malloc()</code>. The downsides
are:</p>

<ul>
  <li>It’s not portable: <code class="language-plaintext highlighter-rouge">alloca()</code> is only a common extension, never
standardized, and for good reason.</li>
  <li>This is still a dynamic stack allocation, so, like I showed in the
last article, the function making this allocation becomes more
complex. It must manage its own stack frame dynamically.</li>
</ul>

<h3 id="optimizing-out-alloca">Optimizing out <code class="language-plaintext highlighter-rouge">alloca()</code>?</h3>

<p>The second issue can possibly be resolved if the size is available as a
compile time constant. This starts to break the abstraction provided by
opaque structs, but they’re still <em>mostly</em> opaque. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* API additions */</span>
<span class="cp">#define STACK_SIZE 24
</span>
<span class="cm">/* In practice, this would likely be horrific #ifdef spaghetti! */</span>
</code></pre></div></div>

<p>The caller might use it like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">stack</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">STACK_SIZE</span><span class="p">);</span>
<span class="n">stack_init</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
</code></pre></div></div>

<p>Now the compiler can see the allocation size, and potentially optimize
away the <code class="language-plaintext highlighter-rouge">alloca()</code>. As of this writing, Clang (all versions) can
optimize these fixed-size <code class="language-plaintext highlighter-rouge">alloca()</code> usages, but GCC (9.2) still does
not. Here’s a simple example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;alloca.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="cp">#ifdef ALLOCA
</span>    <span class="k">volatile</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="mi">64</span><span class="p">);</span>
<span class="cp">#else
</span>    <span class="k">volatile</span> <span class="kt">char</span> <span class="n">s</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="cp">#endif
</span>    <span class="n">s</span><span class="p">[</span><span class="mi">63</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With the <code class="language-plaintext highlighter-rouge">char</code> array version, both GCC and Clang produce optimal code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;foo&gt;:
   0:	c6 44 24 f8 00       	mov    BYTE PTR [rsp-0x1],0x0
   5:	c3                   	ret
</code></pre></div></div>

<p>Side note: This is on x86-64 Linux, which uses the System V ABI. The
entire array falls within the <a href="https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/">red zone</a>, so it doesn’t need to be
explicitly allocated.</p>

<p>With <code class="language-plaintext highlighter-rouge">-DALLOCA</code>, Clang does the same, but GCC does the allocation
inefficiently as if it were dynamic:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;foo&gt;:
   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   4:	48 83 ec 50          	sub    rsp,0x50
   8:	48 8d 44 24 0f       	lea    rax,[rsp+0xf]
   d:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  11:	c6 40 3f 00          	mov    BYTE PTR [rax+0x3f],0x0
  15:	c9                   	leave
  16:	c3                   	ret
</code></pre></div></div>

<p>It would make a slightly better case for <code class="language-plaintext highlighter-rouge">alloca()</code> here if GCC was
better about optimizing it. Regardless, this is another neat little
trick that I probably wouldn’t use in practice.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Legitimate Use of Variable Length Arrays</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/10/27/"/>
    <id>urn:uuid:acf6af69-f18c-49a6-b3ae-a23ae537da6d</id>
    <updated>2019-10-27T19:58:00Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=21375580">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/dz1fau/variable_length_arrays_in_c_are_nearly_always_the/">on reddit</a></em>.</p>

<p>The C99 (ISO/IEC 9899:1999) standard of C introduced a new, powerful
feature called Variable Length Arrays (VLAs). The size of an array with
automatic storage duration (i.e. stack allocated) can be determined at
run time. Each instance of the array may even have a different length.
Unlike <code class="language-plaintext highlighter-rouge">alloca()</code>, they’re a sanctioned form of dynamic stack
allocation.</p>

<!--more-->

<p>At first glance, VLAs seem convenient, useful, and efficient. Heap
allocations have a small cost because the allocator needs to do some
work to find or request some free memory, and typically the operation
must be synchronized since there may be other threads also making
allocations. Stack allocations are trivial and fast by comparison:
Allocation is a matter of bumping the stack pointer, and no
synchronization is needed.</p>

<p>For example, here’s a function that non-destructively finds the median
of a buffer of floats:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* note: nmemb must be non-zero */</span>
<span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It uses a VLA, <code class="language-plaintext highlighter-rouge">copy</code>, as a temporary copy of the input for sorting. The
function doesn’t know at compile time how big the input will be, so it
cannot just use a fixed size. With a VLA, it efficiently allocates
exactly as much memory as needed on the stack.</p>

<p>Well, sort of. If <code class="language-plaintext highlighter-rouge">nmemb</code> is too large, then the VLA will <em>silently</em>
overflow the stack. By silent I mean that the program has no way to
detect it and avoid it. In practice, it can be a lot louder, from a
segmentation fault in the best case, to an exploitable vulnerability in
the worst case: <a href="/blog/2017/06/21/"><strong>stack clashing</strong></a>. If an attacker can control
<code class="language-plaintext highlighter-rouge">nmemb</code>, they might choose a value that causes <code class="language-plaintext highlighter-rouge">copy</code> to overlap with
other allocations, giving them control over those values as well.</p>

<p>If there’s any risk that <code class="language-plaintext highlighter-rouge">nmemb</code> is too large, it must be guarded.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define COPY_MAX 4096
</span>
<span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>  <span class="cm">/* or whatever */</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, if <code class="language-plaintext highlighter-rouge">median</code> is expected to safely accommodate <code class="language-plaintext highlighter-rouge">COPY_MAX</code>
elements, it may as well <em>always</em> allocate an array of this size. If it
can’t, then that’s not a safe maximum.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">COPY_MAX</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And rather than abort, you might still want to support arbitrary input
sizes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">buf</span><span class="p">[</span><span class="n">COPY_MAX</span><span class="p">];</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">copy</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">COPY_MAX</span><span class="p">)</span>
        <span class="n">copy</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">result</span> <span class="o">=</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">copy</span> <span class="o">!=</span> <span class="n">buf</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then small inputs are fast, but large inputs still work correctly. This
is called <a href="/blog/2016/10/07/"><strong>small size optimization</strong></a>.</p>

<p>If the correct solution ultimately didn’t use a VLA, then what good are
they? In general, VLAs not useful. They’re <a href="https://www.phoronix.com/scan.php?page=news_item&amp;px=Linux-Kills-The-VLA">time bombs</a>. <strong>VLAs
are nearly always the wrong choice.</strong> You must be careul to check that
they don’t exceed some safe maximum, and there’s no reason not to always
use the maximum. This problem was realized for the C11 standard (ISO/IEC
9899:2011) where VLAs were made optional. A program containing a VLA
will not necessarily compile on a C11 compiler.</p>

<p>Some purists also object to a special exception required for VLAs: The
<code class="language-plaintext highlighter-rouge">sizeof</code> operator may evaluate its operand, and so it does not always
evaluate to compile-time constant. If the operand contains a VLA, then
the result depends on a run-time value.</p>

<p>Because they’re optional, it’s best to avoid even <em>trivial</em> VLAs like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">median</span><span class="p">(</span><span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">max</span> <span class="o">=</span> <span class="mi">4096</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="n">max</span><span class="p">)</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="kt">float</span> <span class="n">copy</span><span class="p">[</span><span class="n">max</span><span class="p">];</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">nmemb</span><span class="p">);</span>
    <span class="n">qsort</span><span class="p">(</span><span class="n">copy</span><span class="p">,</span> <span class="n">nmemb</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">copy</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">floatcmp</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">copy</span><span class="p">[</span><span class="n">nmemb</span> <span class="o">/</span> <span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s easy to prove that the array length is always 4096, but technically
this is still a VLA. That would still be true even if <code class="language-plaintext highlighter-rouge">max</code> were <code class="language-plaintext highlighter-rouge">const
int</code>, because the array length still isn’t a constant integral
expression.</p>

<h3 id="vla-overhead">VLA overhead</h3>

<p>Finally, there’s also the problem that VLAs just aren’t as efficient as
you might hope. A function that does dynamic stack allocation requires
additional stack management. It must track additional memory addresses
and will require extra instructions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">fixed</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">];</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">dynamic</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">14</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">volatile</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiled with <code class="language-plaintext highlighter-rouge">gcc -Os</code> and viewed with <code class="language-plaintext highlighter-rouge">objdump -d -Mintel</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000000 &lt;fixed&gt;:
   0:	81 ff 00 40 00 00    	cmp    edi,0x4000
   6:	7f 19                	jg     21 &lt;fixed+0x21&gt;
   8:	ff cf                	dec    edi
   a:	48 81 ec 88 3f 00 00 	sub    rsp,0x3f88
  11:	48 63 ff             	movsxd rdi,edi
  14:	c6 44 3c 88 00       	mov    BYTE PTR [rsp+rdi*1-0x78],0x0
  19:	48 81 c4 88 3f 00 00 	add    rsp,0x3f88
  20:	c3                   	ret    
  21:	c3                   	ret    

0000000000000022 &lt;dynamic&gt;:
  22:	81 ff 00 40 00 00    	cmp    edi,0x4000
  28:	7f 23                	jg     4d &lt;dynamic+0x2b&gt;
  2a:	55                   	push   rbp
  2b:	48 63 c7             	movsxd rax,edi
  2e:	ff cf                	dec    edi
  30:	48 83 c0 0f          	add    rax,0xf
  34:	48 63 ff             	movsxd rdi,edi
  37:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  3b:	48 89 e5             	mov    rbp,rsp
  3e:	48 89 e2             	mov    rdx,rsp
  41:	48 29 c4             	sub    rsp,rax
  44:	c6 04 3c 00          	mov    BYTE PTR [rsp+rdi*1],0x0
  48:	48 89 d4             	mov    rsp,rdx
  4b:	c9                   	leave  
  4c:	c3                   	ret    
  4d:	c3                   	ret    
</code></pre></div></div>

<p>Note the use of a base pointer, <code class="language-plaintext highlighter-rouge">rbp</code> and <code class="language-plaintext highlighter-rouge">leave</code>, in the second
function in order to dynamically track the stack frame. (Hmm, in both
cases GCC could easily shave off the extra <code class="language-plaintext highlighter-rouge">ret</code> at the end of each
function. Missed optimization?)</p>

<p>The story is even worse when stack clash protection is enabled
(<code class="language-plaintext highlighter-rouge">-fstack-clash-protection</code>). The compiler generates extra code to probe
every page of allocation in case one of those pages is a guard page.
That’s also more complex when the allocation is dynamic. The VLA version
more than doubles in size (from 44 bytes to 101 bytes)!</p>

<h3 id="safe-and-useful-variable-length-arrays">Safe and useful variable length arrays</h3>

<p>There is one convenient, useful, and safe form of VLAs: a pointer to a
VLA. It’s convenient and useful because it makes some expressions
simpler. It’s safe because there’s no arbitrary stack allocation.</p>

<p>Pointers to arrays are a rare sight in C code, whether variable length
or not. That’s because, the vast majority of the time, C programmers
implicitly rely on <em>array decay</em>: arrays quietly “decay” into pointers
to their first element the moment you do almost anything with them. Also
because they’re really awkward to use.</p>

<p>For example, the function <code class="language-plaintext highlighter-rouge">sum3</code> takes a pointer to an array of exactly
three elements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">sum3</span><span class="p">(</span><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">3</span><span class="p">])</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">2</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The parentheses are necessary because, without them, <code class="language-plaintext highlighter-rouge">array</code> would be an
array of pointers — a type far more common than a pointer to an array.
To index into the array, first the pointer to the array must be
dereferenced to the array value itself, then this intermediate array is
indexed triggering array decay. Conceptually there’s quite a bit to it,
but, in practice, it’s all as efficient as the conventional approach to
<code class="language-plaintext highlighter-rouge">sum3</code> that accepts a plain <code class="language-plaintext highlighter-rouge">int *</code>.</p>

<p>The caller must take the address of an array of exactly the right
length:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">buf</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">};</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">sum3</span><span class="p">(</span><span class="o">&amp;</span><span class="n">buf</span><span class="p">);</span>
</code></pre></div></div>

<p>Or if dynamically allocating the array:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">));</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">(</span><span class="o">*</span><span class="n">array</span><span class="p">)[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">sum3</span><span class="p">(</span><span class="n">array</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">array</span><span class="p">);</span>
</code></pre></div></div>

<p>The mandatory parentheses and strict type requirements make this awkward
and rarely useful. However, with VLAs perhaps it’s worth the trouble!
Consider an NxN matrix expressed using a pointer to a VLA:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="cm">/* run-time value */</span><span class="p">;</span>
<span class="cm">/* TODO: Check for integer overflow. See note. */</span>
<span class="kt">float</span> <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">n</span><span class="p">][</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">y</span><span class="p">][</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When indexing, the parentheses are weird, but the indices have the
convenient <code class="language-plaintext highlighter-rouge">[y][x]</code> format. The non-VLA alternative is to compute a 1D
index manually from 2D indices (<code class="language-plaintext highlighter-rouge">y*n+x</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">n</span> <span class="o">=</span> <span class="cm">/* run-time value */</span><span class="p">;</span>
<span class="cm">/* TODO: Check for integer overflow. */</span>
<span class="kt">float</span> <span class="o">*</span><span class="n">identity</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">identity</span><span class="p">[</span><span class="n">y</span><span class="o">*</span><span class="n">n</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note: What’s the behavior in the VLA version when <code class="language-plaintext highlighter-rouge">n</code> is so large that
<code class="language-plaintext highlighter-rouge">sizeof(*identity)</code> doesn’t fit in a <code class="language-plaintext highlighter-rouge">size_t</code>? I couldn’t find anything
in the standard about it, though I bet it’s undefined behavior. Neither
GCC and Clang check for overflow and, when it occurs, the overflow is
silent. Neither the undefined behavior sanitizer nor address sanitizer
complain when this happens.</p>

<p><strong>Update</strong>: <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCAP-ht1CQKVByZt1EXOb3J7TF%3DMcCKi%3DEtzjEH+CaEsPtvY5%3Djg%40mail.gmail.com%3E">bru del pointed out</a> that these multi-dimensional
VLAs can be simplified such that the parenthesis may be omitted when
indexing. The trick is to omit the first dimension from the VLA
expression:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">identity</span><span class="p">)</span> <span class="o">*</span> <span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">identity</span><span class="p">[</span><span class="n">y</span><span class="p">][</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So VLAs <em>might</em> be worth the trouble when using pointers to
multi-dimensional, dynamically-allocated arrays. However, I’m still
judicious about their use due to reduced portability. As a practical
example, MSVC famously does not, and likely will never will, support
VLAs.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Go Slices are Fat Pointers</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/06/30/"/>
    <id>urn:uuid:5ba40d47-11e4-4f82-b805-f5e7825df44c</id>
    <updated>2019-06-30T21:27:19Z</updated>
    <category term="c"/><category term="go"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=20321116">on Hacker News</a>.</em></p>

<p>One of the frequent challenges in C is that pointers are nothing but a
memory address. A callee who is passed a pointer doesn’t truly know
anything other than the type of object being pointed at, which says some
things about alignment and how that pointer can be used… maybe. If it’s
a pointer to void (<code class="language-plaintext highlighter-rouge">void *</code>) then not even that much is known.</p>

<!--more-->

<p>The number of consecutive elements being pointed at is also not known.
It could be as few as zero, so dereferencing would be illegal. This can
be true even when the pointer is not null. Pointers can go one past the
end of an array, at which point it points to zero elements. For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">void</span> <span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">array</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="n">foo</span><span class="p">(</span><span class="n">array</span> <span class="o">+</span> <span class="mi">4</span><span class="p">);</span>  <span class="c1">// pointer one past the end</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In some situations, the number of elements <em>is</em> known, at least to the
programmer. For example, the function might have a contract that says it
must be passed <em>at least</em> N elements, or <em>exactly</em> N elements. This
could be communicated through documentation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Foo accepts 4 int values. */</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Or it could be implied by the function’s prototype. Despite the
following function appearing to accept an array, that’s actually a
pointer, and the “4” isn’t relevant to the prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span><span class="p">[</span><span class="mi">4</span><span class="p">]);</span>
</code></pre></div></div>

<p>C99 introduced a feature to make this a formal part of the prototype,
though, unfortunately, I’ve never seen a compiler actually use this
information.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span><span class="p">[</span><span class="k">static</span> <span class="mi">4</span><span class="p">]);</span>  <span class="c1">// &gt;= 4 elements, cannot be null</span>
</code></pre></div></div>

<p>Another common pattern is for the callee to accept a count parameter.
For example, the POSIX <code class="language-plaintext highlighter-rouge">write()</code> function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>The necessary information describing the buffer is split across two
arguments. That can become tedious, and it’s also a source of serious
bugs if the two parameters aren’t in agreement (buffer overflow,
<a href="/blog/2017/07/19/">information disclosure</a>, etc.). Wouldn’t it be nice if this
information was packed into the pointer itself? That’s essentially the
definition of a <em>fat pointer</em>.</p>

<h3 id="fat-pointers-via-bit-hacks">Fat pointers via bit hacks</h3>

<p>If we assume some things about the target platform, we can encode fat
pointers inside a plain pointer with <a href="/blog/2016/05/30/">some dirty pointer
tricks</a>, exploiting unused bits in the pointer value. For
example, currently on x86-64, only the lower 48 bits of a pointer are
actually used. The other 16 bits could carefully be used for other
information, like communicating the number of elements or bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// NOTE: x86-64 only!</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">1000</span><span class="p">];</span>
<span class="n">uintptr</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">buf</span> <span class="o">&amp;</span> <span class="mh">0xffffffffffff</span><span class="p">;</span>
<span class="n">uintptr</span> <span class="n">pack</span> <span class="o">=</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span><span class="p">)</span> <span class="o">|</span> <span class="n">addr</span><span class="p">;</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">fatptr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">pack</span><span class="p">;</span>
</code></pre></div></div>

<p>The other side can unpack this to get the components back out. Obviously
16 bits for the count will often be insufficient, so this would more
likely be used for <a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy bounds checks</a>.</p>

<p>Further, if we know something about the alignment — say, that it’s
16-byte aligned — then we can also encode information in the least
significant bits, such as a type tag.</p>

<h3 id="fat-pointers-via-a-struct">Fat pointers via a struct</h3>

<p>That’s all fragile, non-portable, and rather limited. A more robust
approach is to lift pointers up into a richer, heavier type, like a
structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fatptr</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">len</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Functions accepting these fat pointers no longer need to accept a count
parameter, and they’d generally accept the fat pointer by value.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fatptr_write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">struct</span> <span class="n">fatptr</span><span class="p">);</span>
</code></pre></div></div>

<p>In typical C implementations, the structure fields would be passed
practically, if not exactly, same way as the individual parameters would
have been passed, so it’s really no less efficient. (<strong>Update June 2024</strong>:
Pengji Zhang pointed out that this <a href="https://lists.sr.ht/~skeeto/public-inbox/%3CCANOCUiz9ZjRi06pvSDmKsXcHcTiWfAJCeKQUn3EYCh7Tv0poVA@mail.gmail.com%3E">applies only to the 2-element <code class="language-plaintext highlighter-rouge">struct
fatptr</code></a>, and not to 3-element slice headers discussed below.)</p>

<p>To help keep this straight, we might employ some macros:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define COUNTOF(array) \
    (sizeof(array) / sizeof(array[0]))
</span>
<span class="cp">#define FATPTR(ptr, count) \
    (struct fatptr){ptr, count}
</span>
<span class="cp">#define ARRAYPTR(array) \
    FATPTR(array, COUNTOF(array))
</span>
<span class="cm">/* ... */</span>

<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">40</span><span class="p">];</span>
<span class="n">fatptr_write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">ARRAYPTR</span><span class="p">(</span><span class="n">buf</span><span class="p">));</span>
</code></pre></div></div>

<p>There are obvious disadvantages of this approach, like type confusion
due to that void pointer, the inability to use <code class="language-plaintext highlighter-rouge">const</code>, and just being
weird for C. I wouldn’t use it in a real program, but bear with me.</p>

<p>Before I move on, I want to add one more field to that fat pointer
struct: capacity.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fatptr</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">cap</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This communicates not how many elements are present (<code class="language-plaintext highlighter-rouge">len</code>), but how
much additional space is left in the buffer. This allows callees know
how much room is left for, say, appending new elements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Fix the remainder of an int buffer with a value.</span>
<span class="kt">void</span>
<span class="nf">fill</span><span class="p">(</span><span class="k">struct</span> <span class="n">fatptr</span> <span class="n">ptr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ptr</span><span class="p">.</span><span class="n">cap</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the callee modifies the fat pointer, it should be returned:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fatptr</span>
<span class="nf">fill</span><span class="p">(</span><span class="k">struct</span> <span class="n">fatptr</span> <span class="n">ptr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">.</span><span class="n">ptr</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">.</span><span class="n">len</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ptr</span><span class="p">.</span><span class="n">cap</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">ptr</span><span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">.</span><span class="n">cap</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ptr</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Congratulations, you’ve got slices! Except that in Go they’re a proper
part of the language and so doesn’t rely on hazardous hacks or tedious
bookkeeping. The <code class="language-plaintext highlighter-rouge">fatptr_write()</code> function above is nearly functionally
equivalent to the <code class="language-plaintext highlighter-rouge">Writer.Write()</code> method in Go, which accepts a slice:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Writer</span> <span class="k">interface</span> <span class="p">{</span>
	<span class="n">Write</span><span class="p">(</span><span class="n">p</span> <span class="p">[]</span><span class="kt">byte</span><span class="p">)</span> <span class="p">(</span><span class="n">n</span> <span class="kt">int</span><span class="p">,</span> <span class="n">err</span> <span class="kt">error</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">buf</code> and <code class="language-plaintext highlighter-rouge">count</code> parameters are packed together as a slice, and
<code class="language-plaintext highlighter-rouge">fd</code> parameter is instead the <em>receiver</em> (the object being acted upon by
the method).</p>

<h3 id="go-slices">Go slices</h3>

<p>Go famously has pointers, including <em>internal pointers</em>, but not pointer
arithmetic. You can take the address of (<a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoAddressableValues">nearly</a>) anything, but
you can’t make that pointer point at anything else, even if you took the
address of an array element. Pointer arithmetic would undermine Go’s
type safety, so it can only be done through special mechanisms in the
<code class="language-plaintext highlighter-rouge">unsafe</code> package.</p>

<p>But pointer arithmetic is really useful! It’s handy to take an address
of an array element, pass it to a function, and allow that function to
modify a <em>slice</em> (wink, wink) of the array. <strong>Slices are pointers that
support exactly this sort of pointer arithmetic, but safely.</strong> Unlike
the <code class="language-plaintext highlighter-rouge">&amp;</code> operator which creates a simple pointer, the slice operator
derives a fat pointer.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">fill</span><span class="p">([]</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">)</span> <span class="p">[]</span><span class="kt">int</span>

<span class="k">var</span> <span class="n">array</span> <span class="p">[</span><span class="m">8</span><span class="p">]</span><span class="kt">int</span>

<span class="c">// len == 0, cap == 8, like &amp;array[0]</span>
<span class="n">fill</span><span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="o">:</span><span class="m">0</span><span class="p">],</span> <span class="m">1</span><span class="p">)</span>
<span class="c">// array is [1, 1, 1, 1, 1, 1, 1, 1]</span>

<span class="c">// len == 0, cap == 4, like &amp;array[4]</span>
<span class="n">fill</span><span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="m">4</span><span class="o">:</span><span class="m">4</span><span class="p">],</span> <span class="m">2</span><span class="p">)</span>
<span class="c">// array is [1, 1, 1, 1, 2, 2, 2, 2]</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fill</code> function could take a slice of the slice, effectively moving
the pointer around with pointer arithmetic, but without violating memory
safety due to the additional “fat pointer” information. In other words,
fat pointers can be derived from other fat pointers.</p>

<p>Slices aren’t as universal as pointers, at least at the moment. You can
take the address of any variable using <code class="language-plaintext highlighter-rouge">&amp;</code>, but you can’t take a <em>slice</em>
of any variable, even if it would be logically sound.</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">foo</span> <span class="kt">int</span>

<span class="c">// attempt to make len = 1, cap = 1 slice backed by foo</span>
<span class="k">var</span> <span class="n">fooslice</span> <span class="p">[]</span><span class="kt">int</span> <span class="o">=</span> <span class="n">foo</span><span class="p">[</span><span class="o">:</span><span class="p">]</span>   <span class="c">// compile-time error!</span>
</code></pre></div></div>

<p>That wouldn’t be very useful anyway. However, if you <em>really</em> wanted to
do this, the <code class="language-plaintext highlighter-rouge">unsafe</code> package can accomplish it. I believe the resulting
slice would be perfectly safe to use:</p>

<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Convert to one-element array, then slice</span>
<span class="n">fooslice</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="kt">int</span><span class="p">)(</span><span class="n">unsafe</span><span class="o">.</span><span class="n">Pointer</span><span class="p">(</span><span class="o">&amp;</span><span class="n">foo</span><span class="p">))[</span><span class="o">:</span><span class="p">]</span>
</code></pre></div></div>

<p>Update: <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/GoVariableToArrayConversion">Chris Siebenmann speculated about why this requires
<code class="language-plaintext highlighter-rouge">unsafe</code></a>.</p>

<p>Of course, slices are super flexible and have many more uses that look
less like fat pointers, but this is still how I tend to reason about
slices when I write Go.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Looking for Entropy in All the Wrong Places</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/04/30/"/>
    <id>urn:uuid:67da1a72-1103-4e12-a646-8a57443619eb</id>
    <updated>2019-04-30T22:50:09Z</updated>
    <category term="c"/><category term="lua"/><category term="crypto"/>
    <content type="html">
      <![CDATA[<p>Imagine we’re writing a C program and we need some random numbers. Maybe
it’s for a game, or for a Monte Carlo simulation, or for cryptography.
The standard library has a <code class="language-plaintext highlighter-rouge">rand()</code> function for some of these purposes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">rand</span><span class="p">();</span>
</code></pre></div></div>

<p>There are some problems with this. Typically the implementation is a
rather poor PRNG, and <a href="/blog/2017/09/21/">we can do much better</a>. It’s a poor choice
for Monte Carlo simulations, and outright dangerous for cryptography.
Furthermore, it’s usually a dynamic function call, which <a href="/blog/2018/05/27/">has a high
overhead</a> compared to how little the function actually does. In
glibc, it’s also synchronized, adding even more overhead.</p>

<p>But, more importantly, this function returns the same sequences of
values each time the program runs. If we want different numbers each
time the program runs, it needs to be seeded — but seeded with <em>what</em>?
Regardless of what PRNG we ultimately use, we need inputs unique to this
particular execution.</p>

<h3 id="the-right-places">The right places</h3>

<p>On any modern unix-like system, the classical approach is to open
<code class="language-plaintext highlighter-rouge">/dev/urandom</code> and read some bytes. It’s not part of POSIX but it is a
<em>de facto</em> standard. These random bits are seeded from the physical
world by the operating system, making them highly unpredictable and
uncorrelated. They’re are suitable for keying a CSPRNG and, from
there, <a href="https://blog.cr.yp.to/20140205-entropy.html">generating all the secure random bits you will ever
need</a> (perhaps with <a href="https://blog.cr.yp.to/20170723-random.html">fast-key-erasure</a>). Why not
<code class="language-plaintext highlighter-rouge">/dev/random</code>? Because on Linux <a href="https://www.2uo.de/myths-about-urandom/">it’s pointlessly
superstitious</a>, which has basically ruined that path for
everyone.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Returns zero on failure. */</span>
<span class="kt">int</span>
<span class="nf">getbits</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">result</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/urandom"</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">fread</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
        <span class="n">fclose</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="n">seed</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">getbits</span><span class="p">(</span><span class="o">&amp;</span><span class="n">seed</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">seed</span><span class="p">)))</span> <span class="p">{</span>
        <span class="n">srand</span><span class="p">(</span><span class="n">seed</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">die</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note how there are two different places <code class="language-plaintext highlighter-rouge">getbits()</code> could fail, with
multiple potential causes.</p>

<ul>
  <li>
    <p>It could fail to open the file. Perhaps the program isn’t running on a
modern unix-like system. Perhaps it’s running in a chroot and
<code class="language-plaintext highlighter-rouge">/dev/urandom</code> wasn’t created. Perhaps there are too many file
descriptors already open. Perhaps there isn’t enough memory available
to open a file. Perhaps the file permissions disallow it or it’s
blocked by Mandatory Access Control (MAC).</p>
  </li>
  <li>
    <p>It could fail to read the file. This essentially can’t happen unless
the system is severely misconfigured, in which case a successful
read would be suspect anyway. In this case it’s probably still a
good idea to check the result.</p>
  </li>
</ul>

<p>The need for creating a file descriptor a serious issue for libraries.
Libraries that quietly create and close file descriptors can interfere
with the main program, especially if its asynchronous. The main program
might rely on file descriptors being consecutive, predictable, or
monotonic (<a href="https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html">example</a>). File descriptors are also a limited resource,
so it may exhaust a file descriptor slot needed for the main program.
For a network service, a remote attacker could perhaps open enough
sockets to deny a file descriptor to <code class="language-plaintext highlighter-rouge">getbits()</code>, blocking the program
from gathering entropy.</p>

<p><code class="language-plaintext highlighter-rouge">/dev/urandom</code> is simple, but it’s not an ideal API.</p>

<h4 id="getentropy2">getentropy(2)</h4>

<p>Wouldn’t it be nicer if our program could just directly ask the
operating system to fill a buffer with random bits? That’s what the
OpenBSD folks thought, so they introduced a <a href="https://man.openbsd.org/getentropy.2"><code class="language-plaintext highlighter-rouge">getentropy(2)</code></a>
system call. When called correctly <em>it cannot fail</em>!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">getentropy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">buflen</span><span class="p">);</span>
</code></pre></div></div>

<p>Other operating systems followed suit, <a href="https://lwn.net/Articles/711013/">including Linux</a>, though
on Linux <code class="language-plaintext highlighter-rouge">getentropy(2)</code> is a library function implemented using
<a href="http://man7.org/linux/man-pages/man2/getrandom.2.html"><code class="language-plaintext highlighter-rouge">getrandom(2)</code></a>, the actual system call. It’s been in the Linux
kernel since version 3.17 (October 2014), but the libc wrapper didn’t
appear in glibc until version 2.25 (February 2017). So as of this
writing, there are still many systems where it’s still not practical
to use even if their kernel is new enough.</p>

<p>For now on Linux you may still want to check, and have a strategy in
place, for an <code class="language-plaintext highlighter-rouge">ENOSYS</code> result. Some systems are still running kernels
that are 5 years old, or older.</p>

<p>OpenBSD also has another trick up its trick-filled sleeves: the
<a href="https://github.com/openbsd/src/blob/master/libexec/ld.so/SPECS.randomdata"><code class="language-plaintext highlighter-rouge">.openbsd.randomdata</code></a> section. Just as the <code class="language-plaintext highlighter-rouge">.bss</code> section is
filled with zeros, the <code class="language-plaintext highlighter-rouge">.openbsd.randomdata</code> section is filled with
securely-generated random bits. You could put your PRNG state in this
section and it will be seeded as part of loading the program. Cool!</p>

<h4 id="rtlgenrandom">RtlGenRandom()</h4>

<p>Windows doesn’t have <code class="language-plaintext highlighter-rouge">/dev/urandom</code>. Instead it has:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CryptGenRandom()</code></li>
  <li><code class="language-plaintext highlighter-rouge">CryptAcquireContext()</code></li>
  <li><code class="language-plaintext highlighter-rouge">CryptReleaseContext()</code></li>
</ul>

<p>Though in typical Win32 fashion, the API is ugly, overly-complicated,
and has multiple possible failure points. It’s essentially impossible
to use without referencing documentation. Ugh.</p>

<p>However, <a href="/blog/2018/04/13/">Windows 98 and later</a> has <a href="https://docs.microsoft.com/en-us/windows/desktop/api/ntsecapi/nf-ntsecapi-rtlgenrandom"><code class="language-plaintext highlighter-rouge">RtlGenRandom()</code></a>,
which has a much more reasonable interface. Looks an awful lot like
<code class="language-plaintext highlighter-rouge">getentropy(2)</code>, eh?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BOOLEAN</span> <span class="nf">RtlGenRandom</span><span class="p">(</span>
  <span class="n">PVOID</span> <span class="n">RandomBuffer</span><span class="p">,</span>
  <span class="n">ULONG</span> <span class="n">RandomBufferLength</span>
<span class="p">);</span>
</code></pre></div></div>

<p>The problem is that it’s not quite an official API, and no promises
are made about it. In practice, far too much software now depends on
it that the API is unlikely to ever break. Despite the prototype
above, this function is <em>actually</em> named <code class="language-plaintext highlighter-rouge">SystemFunction036()</code>, and
you have to supply your own prototype. Here’s my little drop-in
snippet that turns it nearly into <code class="language-plaintext highlighter-rouge">getentropy(2)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _WIN32
#  define WIN32_LEAN_AND_MEAN
#  include &lt;windows.h&gt;
#  pragma comment(lib, "advapi32.lib")
</span>   <span class="n">BOOLEAN</span> <span class="n">NTAPI</span> <span class="nf">SystemFunction036</span><span class="p">(</span><span class="n">PVOID</span><span class="p">,</span> <span class="n">ULONG</span><span class="p">);</span>
<span class="cp">#  define getentropy(buf, len) (SystemFunction036(buf, len) ? 0 : -1)
#endif
</span></code></pre></div></div>

<p>It works in Wine, too, where, at least in my version, it reads from
<code class="language-plaintext highlighter-rouge">/dev/urandom</code>.</p>

<h3 id="the-wrong-places">The wrong places</h3>

<p>That’s all well and good, but suppose we’re masochists. We want our
program to be <a href="/blog/2017/03/30/">maximally portable</a> so we’re sticking strictly to
functionality found in the standard C library. That means no
<code class="language-plaintext highlighter-rouge">getentropy(2)</code> and no <code class="language-plaintext highlighter-rouge">RtlGenRandom()</code>. We can still try to open
<code class="language-plaintext highlighter-rouge">/dev/urandom</code>, but it might fail, or it might not actually be useful,
so we’ll want a backup.</p>

<p>The usual approach found in a thousand tutorials is <code class="language-plaintext highlighter-rouge">time(3)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">srand</span><span class="p">(</span><span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">));</span>
</code></pre></div></div>

<p>It would be better to <a href="/blog/2018/07/31/">use an integer hash function</a> to mix up the
result from <code class="language-plaintext highlighter-rouge">time(0)</code> before using it as a seed. Otherwise two programs
started close in time may have similar initial sequences.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">srand</span><span class="p">(</span><span class="n">triple32</span><span class="p">(</span><span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">)));</span>
</code></pre></div></div>

<p>The more pressing issue is that <code class="language-plaintext highlighter-rouge">time(3)</code> has a resolution of one
second. If the program is run twice inside of a second, they’ll both
have the same sequence of numbers. It would be better to use a higher
resolution clock, but, <strong>standard C doesn’t provide a clock with greater
than one second resolution</strong>. That normally requires calling into POSIX
or Win32.</p>

<p>So, we need to find some other sources of entropy unique to each
execution of the program.</p>

<h4 id="quick-and-dirty-string-hash-function">Quick and dirty “string” hash function</h4>

<p>Before we get into that, we need a way to mix these different sources
together. Here’s a <a href="/blog/2018/06/10/">small</a>, 32-bit “string” hash function. The loop
is the same algorithm as Java’s <code class="language-plaintext highlighter-rouge">hashCode()</code>, and I appended <a href="/blog/2018/07/31/">my own
integer hash</a> as a finalizer for much better diffusion.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">hash32s</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">h</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="o">*</span> <span class="mi">31</span> <span class="o">+</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">*=</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0xed5ad4bb</span><span class="p">);</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">*=</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0xac4c1b51</span><span class="p">);</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">*=</span> <span class="n">UINT32_C</span><span class="p">(</span><span class="mh">0x31848bab</span><span class="p">);</span>
    <span class="n">h</span> <span class="o">^=</span> <span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">h</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It accepts a starting hash value, which is essentially a “context” for
the digest that allows different inputs to be appended together. The
finalizer acts as an implicit “stop” symbol in between inputs.</p>

<p>I used fixed-width integers, but it could be written nearly as concisely
using only <code class="language-plaintext highlighter-rouge">unsigned long</code> and some masking to truncate to 32-bits. I
leave this as an exercise to the reader.</p>

<p>Some of the values to be mixed in will be pointers themselves. These
could instead be cast to integers and passed through an integer hash
function, but using string hash avoids <a href="/blog/2016/05/30/">various caveats</a>. Besides,
one of the inputs will be a string, so we’ll need this function anyway.</p>

<h4 id="randomized-pointers-aslr-random-stack-gap-etc">Randomized pointers (ASLR, random stack gap, etc.)</h4>

<p>Attackers can use predictability to their advantage, so modern systems
use unpredictability to improve security. Memory addresses for various
objects and executable code are randomized since some attacks require
an attacker to know their addresses. We can skim entropy from these
pointers to seed our PRNG.</p>

<p>Address Space Layout Randomization (ASLR) is when executable code and
its associated data is loaded to a random offset by the loader. Code
designed for this is called Position Independent Code (PIC). This has
long been used when loading dynamic libraries so that all of the
libraries on a system don’t have to coordinate with each other to
avoid overlapping.</p>

<p>To improve security, it has more recently been extended to programs
themselves. On both modern unix-like systems and Windows,
position-independent executables (PIE) are now the default.</p>

<p>To skim entropy from ASLR, we just need the address of one of our
functions. All the functions in our program will have the same relative
offset, so there’s no reason to use more than one. An obvious choice is
<code class="language-plaintext highlighter-rouge">main()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">uint32_t</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="cm">/* initial hash value */</span>
    <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">mainptr</span><span class="p">)()</span> <span class="o">=</span> <span class="n">main</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mainptr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">mainptr</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>Notice I had to store the address of <code class="language-plaintext highlighter-rouge">main()</code> in a variable, and then
treat <em>the pointer itself</em> as a buffer for the hash function? It’s not
hashing the machine code behind <code class="language-plaintext highlighter-rouge">main</code>, just its address. The symbol
<code class="language-plaintext highlighter-rouge">main</code> doesn’t store an address, so it can’t be given to the hash
function to represent its address. This is analogous to an array
versus a pointer.</p>

<p>On a typical x86-64 Linux system, and when this is a PIE, that’s about
3 bytes worth of entropy. On 32-bit systems, virtual memory is so
tight that it’s worth a lot less. We might want more entropy than
that, and we want to cover the case where the program isn’t compiled
as a PIE.</p>

<p>On unix-like systems, programs are typically dynamically linked against
the C library, libc. Each shared object gets its own ASLR offset, so we
can skim more entropy from each shared object by picking a function or
variable from each. Let’s do <code class="language-plaintext highlighter-rouge">malloc(3)</code> for libc ASLR:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">mallocptr</span><span class="p">)()</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mallocptr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">mallocptr</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>Allocators themselves often randomize the addresses they return so that
data objects are stored at unpredictable addresses. In particular, glibc
uses different strategies for small (<code class="language-plaintext highlighter-rouge">brk(2)</code>) versus big (<code class="language-plaintext highlighter-rouge">mmap(2)</code>)
allocations. That’s two different sources of entropy:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">small</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>        <span class="cm">/* 1 byte */</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">small</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">small</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">small</span><span class="p">);</span>

    <span class="kt">void</span> <span class="o">*</span><span class="n">big</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1UL</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">);</span>  <span class="cm">/* 1 MB */</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">big</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">big</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">big</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally the stack itself is often mapped at a random address, or at
least started with a random gap, so that local variable addresses are
also randomized.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ptr</span><span class="p">;</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ptr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">ptr</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<h4 id="time-sources">Time sources</h4>

<p>We haven’t used <code class="language-plaintext highlighter-rouge">time(3)</code> yet! Let’s still do that, using the full
width of <code class="language-plaintext highlighter-rouge">time_t</code> this time around:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">time_t</span> <span class="n">t</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">t</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">t</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>We do have another time source to consider: <code class="language-plaintext highlighter-rouge">clock(3)</code>. It returns an
approximation of the processor time used by the program. There’s a
tiny bit of noise and inconsistency between repeated calls. We can use
this to extract a little bit of entropy over many repeated calls.</p>

<p>Naively we might try to use it like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="cm">/* Note: don't use this */</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">clock_t</span> <span class="n">c</span> <span class="o">=</span> <span class="n">clock</span><span class="p">();</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">c</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">c</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The problem is that the resolution for <code class="language-plaintext highlighter-rouge">clock()</code> is typically rough
enough that modern computers can execute multiple instructions between
ticks. On Windows, where <code class="language-plaintext highlighter-rouge">CLOCKS_PER_SEC</code> is low, that entire loop
will typically complete before the result from <code class="language-plaintext highlighter-rouge">clock()</code> increments
even once. With that arrangement we’re hardly getting anything from
it! So here’s a better version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">counter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="kt">clock_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">clock</span><span class="p">();</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">clock</span><span class="p">()</span> <span class="o">==</span> <span class="n">start</span><span class="p">)</span>
            <span class="n">counter</span><span class="o">++</span><span class="p">;</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">start</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">start</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="o">&amp;</span><span class="n">counter</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">counter</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The counter makes the resolution of the clock no longer important. If
it’s low resolution, then we’ll get lots of noise from the counter. If
it’s high resolution, then we get noise from the clock value itself.
Running the hash function an extra time between overall <code class="language-plaintext highlighter-rouge">clock(3)</code>
samples also helps with noise.</p>

<h4 id="a-legitimate-use-of-tmpnam3">A legitimate use of tmpnam(3)</h4>

<p>We’ve got one more source of entropy available: <code class="language-plaintext highlighter-rouge">tmpnam(3)</code>. This
function generates a unique, temporary file name. It’s dangerous to
use as intended because it doesn’t actually create the file. There’s a
race between generating the name for the file and actually creating
it.</p>

<p>Fortunately we don’t actually care about the name as a filename. We’re
using this to sample entropy not directly available to us. In attempt to
get a unique name, the standard C library draws on its own sources of
entropy.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">L_tmpnam</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="n">tmpnam</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>The rather unfortunately downside is that lots of modern systems produce
a <em>linker</em> warning when it sees <code class="language-plaintext highlighter-rouge">tmpnam(3)</code> being linked, even though in
this case it’s completely harmless.</p>

<p>So what goes into a temporary filename? It depends on the
implementation.</p>

<h5 id="glibc-and-musl">glibc and musl</h5>

<p>Both get a high resolution timestamp and generate the filename directly
from the timestamp (no hashing, etc.). Unfortunately glibc does a very
poor job of also mixing <code class="language-plaintext highlighter-rouge">getpid(2)</code> into the timestamp before using it,
and probably makes things worse by doing so.</p>

<p>On these platforms, this is is a way to sample a high resolution
timestamp without calling anything non-standard.</p>

<h5 id="dietlibc">dietlibc</h5>

<p>In the latest release as of this writing it uses <code class="language-plaintext highlighter-rouge">rand(3)</code>, which makes
this useless. It’s also a bug since the C library isn’t allowed to
affect the state of <code class="language-plaintext highlighter-rouge">rand(3)</code> outside of <code class="language-plaintext highlighter-rouge">rand(3)</code> and <code class="language-plaintext highlighter-rouge">srand(3)</code>. I
submitted a bug report and this has <a href="https://github.com/ensc/dietlibc/commit/8c8df9579962dc7449fe1f3205fd19eec461aa23">since been fixed</a>.</p>

<p>In the next release it will use a generator seeded by the <a href="https://lwn.net/Articles/301798/">ELF
<code class="language-plaintext highlighter-rouge">AT_RANDOM</code></a> value if available, or ASLR otherwise. This makes
it moderately useful.</p>

<h5 id="libiberty">libiberty</h5>

<p>Generated from <code class="language-plaintext highlighter-rouge">getpid(2)</code> alone, with a counter to handle multiple
calls. It’s basically a way to sample the process ID without actually
calling <code class="language-plaintext highlighter-rouge">getpid(2)</code>.</p>

<h5 id="bsd-libc--bionic-android">BSD libc / Bionic (Android)</h5>

<p>Actually gathers real entropy from the operating system (via
<code class="language-plaintext highlighter-rouge">arc4random(2)</code>), which means we’re getting a lot of mileage out of this
one.</p>

<h5 id="uclibc">uclibc</h5>

<p>Its implementation is obviously forked from glibc. However, it first
tries to read entropy from <code class="language-plaintext highlighter-rouge">/dev/urandom</code>, and only if that fails does
it fallback to glibc’s original high resolution clock XOR <code class="language-plaintext highlighter-rouge">getpid(2)</code>
method (still not hashing it).</p>

<h4 id="finishing-touches">Finishing touches</h4>

<p>Finally, still use <code class="language-plaintext highlighter-rouge">/dev/urandom</code> if it’s available. This doesn’t
require us to trust that the output is anything useful since it’s just
being mixed into the other inputs.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">rnd</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
    <span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/urandom"</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">fread</span><span class="p">(</span><span class="n">rnd</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rnd</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">f</span><span class="p">))</span>
            <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="n">rnd</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">rnd</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
        <span class="n">fclose</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>When we’re all done gathering entropy, set the seed from the result.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">srand</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>   <span class="cm">/* or whatever you're seeding */</span>
</code></pre></div></div>

<p>That’s bound to find <em>some</em> entropy on just about any host. Though
definitely don’t rely on the results for cryptography.</p>

<h3 id="lua">Lua</h3>

<p>I recently tackled this problem in Lua. It has a no-batteries-included
design, demanding very little of its host platform: nothing more than an
ANSI C implementation. Because of this, a Lua program has even fewer
options for gathering entropy than C. But it’s still not impossible!</p>

<p>To further complicate things, Lua code is often run in a sandbox with
some features removed. For example, Lua has <code class="language-plaintext highlighter-rouge">os.time()</code> and <code class="language-plaintext highlighter-rouge">os.clock()</code>
wrapping the C equivalents, allowing for the same sorts of entropy
sampling. When run in a sandbox, <code class="language-plaintext highlighter-rouge">os</code> might not be available. Similarly,
<code class="language-plaintext highlighter-rouge">io</code> might not be available for accessing <code class="language-plaintext highlighter-rouge">/dev/urandom</code>.</p>

<p>Have you ever printed a table, though? Or a function? It evaluates to
a string containing the object’s address.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ lua -e 'print(math)'
table: 0x559577668a30
$ lua -e 'print(math)'
table: 0x55e4a3679a30
</code></pre></div></div>

<p>Since the raw pointer values are leaked to Lua, we can skim allocator
entropy like before. Here’s the same hash function in Lua 5.3:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">local</span> <span class="k">function</span> <span class="nf">hash32s</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="o">#</span><span class="n">buf</span> <span class="k">do</span>
        <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="o">*</span> <span class="mi">31</span> <span class="o">+</span> <span class="n">buf</span><span class="p">:</span><span class="n">byte</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="k">end</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">&amp;</span> <span class="mh">0xffffffff</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">~</span> <span class="p">(</span><span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">)</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="o">*</span> <span class="mh">0xed5ad4bb</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">&amp;</span> <span class="mh">0xffffffff</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">~</span> <span class="p">(</span><span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">)</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="o">*</span> <span class="mh">0xac4c1b51</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">&amp;</span> <span class="mh">0xffffffff</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">~</span> <span class="p">(</span><span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">)</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="o">*</span> <span class="mh">0x31848bab</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">&amp;</span> <span class="mh">0xffffffff</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">h</span> <span class="err">~</span> <span class="p">(</span><span class="n">h</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">h</span>
<span class="k">end</span>
</code></pre></div></div>

<p>Now hash a bunch of pointers in the global environment:</p>

<div class="language-lua highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">local</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">({},</span> <span class="mi">0</span><span class="p">)</span>  <span class="c1">-- hash a new table</span>
<span class="k">for</span> <span class="n">varname</span><span class="p">,</span> <span class="n">value</span> <span class="k">in</span> <span class="nb">pairs</span><span class="p">(</span><span class="n">_G</span><span class="p">)</span> <span class="k">do</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="n">varname</span><span class="p">,</span> <span class="n">h</span><span class="p">)</span>
    <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="nb">tostring</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">h</span><span class="p">)</span>
    <span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">value</span><span class="p">)</span> <span class="o">==</span> <span class="s1">'table'</span> <span class="k">then</span>
        <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="k">in</span> <span class="nb">pairs</span><span class="p">(</span><span class="n">value</span><span class="p">)</span> <span class="k">do</span>
            <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="nb">tostring</span><span class="p">(</span><span class="n">k</span><span class="p">),</span> <span class="n">h</span><span class="p">)</span>
            <span class="n">h</span> <span class="o">=</span> <span class="n">hash32s</span><span class="p">(</span><span class="nb">tostring</span><span class="p">(</span><span class="n">v</span><span class="p">),</span> <span class="n">h</span><span class="p">)</span>
        <span class="k">end</span>
    <span class="k">end</span>
<span class="k">end</span>

<span class="nb">math.randomseed</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
</code></pre></div></div>

<p>Unfortunately this doesn’t actually work well on one platform I tested:
Cygwin. Cygwin has few security features, notably lacking ASLR, and
having a largely deterministic allocator.</p>

<h3 id="when-to-use-it">When to use it</h3>

<p>In practice it’s not really necessary to use these sorts of tricks of
gathering entropy from odd places. It’s something that comes up more
in coding challenges and exercises than in real programs. I’m probably
already making platform-specific calls in programs substantial enough
to need it anyway.</p>

<p>On a few occasions I have thought about these things when debugging.
ASLR makes return pointers on the stack slightly randomized on each
run, which can change the behavior of some kinds of bugs. Allocator
and stack randomization does similar things to most of your pointers.
GDB tries to disable some of these features during debugging, but it
doesn’t get everything.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Fibers: the Most Elegant Windows API</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/03/28/"/>
    <id>urn:uuid:abad2340-99e5-4d72-857c-848e37b4af73</id>
    <updated>2019-03-28T22:26:05Z</updated>
    <category term="win32"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19520078">on Hacker News</a>.</em></p>

<p>The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly,
and lacking good taste. Microsoft has done a pretty commendable job with
backwards compatibility, but the trade-off is that the API is filled to
the brim with historical cruft. Every hasty, poor design over the
decades is carried forward forever, and, in many cases, even built upon,
which essentially doubles down on past mistakes. POSIX certainly has its
own ugly corners, but those are the exceptions. In the Windows API,
elegance is the exception.</p>

<!--more-->

<p>That’s why, when I recently revisited the <a href="https://docs.microsoft.com/en-us/windows/desktop/procthread/fibers">Fibers API</a>, I was
pleasantly surprised. It’s one of the exceptions — much cleaner than the
optional, deprecated, and now obsolete <a href="/blog/2017/06/21/#coroutines">POSIX equivalent</a>. It’s
not quite an apples-to-apples comparison since the POSIX version is
slightly more powerful, and more complicated as a result. I’ll cover the
difference in this article.</p>

<p>For the last part of this article, I’ll walk through an async/await
framework build on top of fibers. The framework allows coroutines in C
programs to await on arbitrary kernel objects.</p>

<p><a href="https://github.com/skeeto/fiber-await"><strong>Fiber Async/await Demo</strong></a></p>

<h3 id="fibers">Fibers</h3>

<p>Windows fibers are really just <a href="https://blog.varunramesh.net/posts/stackless-vs-stackful-coroutines/">stackful</a>, symmetric coroutines.
From a different point of view, they’re cooperatively scheduled threads,
which is the source of the analogous name, <em>fibers</em>. They’re symmetric
because all fibers are equal, and no fiber is the “main” fiber. If <em>any</em>
fiber returns from its start routine, the program exits. (Older versions
of Wine will crash when this happens, but it was recently fixed.) It’s
equivalent to the process’ main thread returning from <code class="language-plaintext highlighter-rouge">main()</code>. The
initial fiber is free to create a second fiber, yield to it, then the
second fiber destroys the first.</p>

<p>For now I’m going to focus on the core set of fiber functions. There are
some additional capabilities I’m going to ignore, including support for
<em>fiber local storage</em>. The important functions are just these five:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">CreateFiber</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">stack_size</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">proc</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">SwitchToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
<span class="n">bool</span>  <span class="nf">ConvertFiberToThread</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">ConvertThreadToFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">DeleteFiber</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span><span class="p">);</span>
</code></pre></div></div>

<p>To emphasize its simplicity, I’ve shown them here with more standard
prototypes than seen in their formal documentation. That documentation
uses the clunky Windows API typedefs still burdened with its 16-bit
heritage — e.g. <code class="language-plaintext highlighter-rouge">LPVOID</code> being a “long pointer” from the segmented memory
of the 8086:</p>

<ul>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-createfiber">CreateFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-switchtofiber">SwitchToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-convertfibertothread">ConvertFiberToThread</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-convertthreadtofiber">ConvertThreadToFiber</a></li>
  <li><a href="https://docs.microsoft.com/en-us/windows/desktop/api/WinBase/nf-winbase-deletefiber">DeleteFiber</a></li>
</ul>

<p>Fibers are represented using opaque, void pointers. Maybe that’s a little
<em>too</em> simple since it’s easy to misuse in C, but I like it. The return
values for <code class="language-plaintext highlighter-rouge">CreateFiber()</code> and <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code> are void pointers
since these both create fibers.</p>

<p>The fiber start routine returns nothing and takes a void “user pointer”.
That’s nearly what I’d expect, except that it would probably make more
sense for a fiber to return <code class="language-plaintext highlighter-rouge">int</code>, which is <a href="/blog/2016/01/31/">more in line with</a>
<code class="language-plaintext highlighter-rouge">main</code> / <code class="language-plaintext highlighter-rouge">WinMain</code> / <code class="language-plaintext highlighter-rouge">mainCRTStartup</code> / <code class="language-plaintext highlighter-rouge">WinMainCRTStartup</code>. As I said,
when any fiber returns from its start routine, it’s like returning from
the main function, so it should probably have returned an integer.</p>

<p>A fiber may delete itself, which is the same as exiting the thread.
However, a fiber cannot yield (e.g. <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>) to itself. That’s
undefined behavior.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">coup</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">king</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"Long live the king!"</span><span class="p">);</span>
    <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">king</span><span class="p">);</span>
    <span class="n">ConvertFiberToThread</span><span class="p">();</span> <span class="cm">/* seize the main thread */</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">king</span> <span class="o">=</span> <span class="n">ConvertThreadToFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">pretender</span> <span class="o">=</span> <span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">coup</span><span class="p">,</span> <span class="n">king</span><span class="p">);</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">pretender</span><span class="p">);</span>
    <span class="n">abort</span><span class="p">();</span> <span class="cm">/* unreachable */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Only fibers can yield to fibers, but when the program starts up, there
are no fibers. At least one thread must first convert itself into a
fiber using <code class="language-plaintext highlighter-rouge">ConvertThreadToFiber()</code>, which returns the fiber object
that represents itself. It takes one argument analogous to the last
argument of <code class="language-plaintext highlighter-rouge">CreateFiber()</code>, except that there’s no start routine to
accept it. The process is reversed with <code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code>.</p>

<p>Fibers don’t belong to any particular thread and can be scheduled on any
thread <em>if</em> properly synchronized. Obviously one should never yield to the
same fiber in two different threads at the same time.</p>

<h3 id="contrast-with-posix">Contrast with POSIX</h3>

<p>The equivalent POSIX systems was context switching. It’s also stackful
and symmetric, but it has just three important functions:
<a href="http://man7.org/linux/man-pages/man3/setcontext.3.html"><code class="language-plaintext highlighter-rouge">getcontext(3)</code></a>, <a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">makecontext(3)</code></a>, and
<a href="http://man7.org/linux/man-pages/man3/makecontext.3.html"><code class="language-plaintext highlighter-rouge">swapcontext</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">getcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">makecontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(),</span> <span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">int</span>  <span class="nf">swapcontext</span><span class="p">(</span><span class="n">ucontext_t</span> <span class="o">*</span><span class="n">oucp</span><span class="p">,</span> <span class="k">const</span> <span class="n">ucontext_t</span> <span class="o">*</span><span class="n">ucp</span><span class="p">);</span>
</code></pre></div></div>

<p>These are roughly equivalent to <a href="https://docs.microsoft.com/en-us/windows/desktop/api/winnt/nf-winnt-getcurrentfiber"><code class="language-plaintext highlighter-rouge">GetCurrentFiber()</code></a>,
<code class="language-plaintext highlighter-rouge">CreateFiber()</code>, and <code class="language-plaintext highlighter-rouge">SwitchToFiber()</code>. There is no need for
<code class="language-plaintext highlighter-rouge">ConvertFiberToThread()</code> since threads can context switch without
preparation. There’s also no <code class="language-plaintext highlighter-rouge">DeleteFiber()</code> because the resources are
managed by the program itself. That’s where POSIX contexts are a little
bit more powerful.</p>

<p>The first argument to <code class="language-plaintext highlighter-rouge">CreateFiber()</code> is the desired stack size, with
zero indicating the default stack size. The stack is allocated and freed
by the operating system. The downside is that the caller doesn’t have a
choice in managing the lifetime of this stack and how it’s allocated. If
you’re frequently creating and destroying coroutines, those stacks are
constantly being allocated and freed.</p>

<p>In <code class="language-plaintext highlighter-rouge">makecontext(3)</code>, the caller allocates and supplies the stack. Freeing
that stack is equivalent to destroying the context. A program that
frequently creates and destroys contexts can maintain a stack pool or
otherwise more efficiently manage their allocation. This makes it more
powerful, but it also makes it a little more complicated. It would be hard
to remember how to do all this without a careful reading of the
documentation:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Create a context */</span>
<span class="n">ucontext_t</span> <span class="n">ctx</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">SIGSTKSZ</span><span class="p">);</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_size</span> <span class="o">=</span> <span class="n">SIGSTKSZ</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">uc_link</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">getcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">);</span>
<span class="n">makecontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">proc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="cm">/* Destroy a context */</span>
<span class="n">free</span><span class="p">(</span><span class="n">ctx</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span><span class="p">);</span>
</code></pre></div></div>

<p>Note how <code class="language-plaintext highlighter-rouge">makecontext(3)</code> is variadic (<code class="language-plaintext highlighter-rouge">...</code>), passing its arguments on
to the start routine of the context. This seems like it might be better
than a user pointer. Unfortunately it’s not, since those arguments are
strictly limited to <em>integers</em>.</p>

<p>Ultimately I like the fiber API better. The first time I tried it out, I
could guess my way through it without looking closely at the
documentation.</p>

<h3 id="async--await-with-fibers">Async / await with fibers</h3>

<p>Why was I looking at the Fiber API? I’ve known about coroutines for
years but I didn’t understand how they could be useful. Sure, the
function can yield, but what other coroutine should it yield to? It
wasn’t until I was <a href="/blog/2019/03/10/">recently bit by the async/await bug</a> that I
finally saw a “killer feature” that justified their use. Generators come
pretty close, though.</p>

<p>Windows fibers are a coroutine primitive suitable for async/await in C
programs, where <a href="/blog/2019/03/22/">it can also be useful</a>. To prove that it’s
possible, I built async/await on top of fibers in <a href="https://github.com/skeeto/fiber-await/blob/master/async.c">95 lines of code</a>.</p>

<p>The alternatives are to use a <a href="https://www.gnu.org/software/pth/">third-party coroutine library</a> or to
do it myself <a href="/blog/2015/05/15/">with some assembly programming</a>. However, having it
built into the operating system is quite convenient! It’s unfortunate
that it’s limited to Windows. Ironically, though, everything I wrote for
this article, including the async/await demonstration, was originally
written on Linux using Mingw-w64 and tested using <a href="https://www.winehq.org/">Wine</a>. Only
after I was done did I even try it on Windows.</p>

<p>Before diving into how it works, there’s a general concept about the
Windows API that must be understood: <strong>All kernel objects can be in
either a signaled or unsignaled state.</strong> The API provides functions that
block on a kernel object until it is signaled. The two important ones
are <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobject"><code class="language-plaintext highlighter-rouge">WaitForSingleObject()</code></a> and <a href="https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitformultipleobjects"><code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code></a>.
The latter behaves very much like <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX.</p>

<p>Usually the signal is tied to some useful event, like a process or
thread exiting, the completion of an I/O operation (i.e. asynchronous
overlapped I/O), a semaphore being incremented, etc. It’s a generic way
to wait for some event. <strong>However, instead of blocking the thread,
wouldn’t it be nice to <em>await</em> on the kernel object?</strong> In my <code class="language-plaintext highlighter-rouge">aio</code>
library for Emacs, the fundamental “wait” object was a promise. For this
API it’s a kernel object handle.</p>

<p>So, the await function will take a kernel object, register it with the
scheduler, then yield to the scheduler. The scheduler — which is a
global variable, so there’s only one scheduler per process — looks like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">main_fiber</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">handles</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">fibers</span><span class="p">[</span><span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">];</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">dead_fiber</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span> <span class="n">async_loop</span><span class="p">;</span>
</code></pre></div></div>

<p>While fibers are symmetric, coroutines in my async/await implementation
are not. One fiber is the scheduler, <code class="language-plaintext highlighter-rouge">main_fiber</code>, and the other fibers
always yield to it.</p>

<p>There is an array of kernel object handles, <code class="language-plaintext highlighter-rouge">handles</code>, and an array of
<code class="language-plaintext highlighter-rouge">fibers</code>. The elements in these arrays are paired with each other, but
it’s convenient to store them separately, as I’ll show soon. <code class="language-plaintext highlighter-rouge">fibers[0]</code>
is waiting on <code class="language-plaintext highlighter-rouge">handles[0]</code>, and so on.</p>

<p>The array is a fixed size, <code class="language-plaintext highlighter-rouge">MAXIMUM_WAIT_OBJECTS</code> (64), because there’s
a hard limit on the number of fibers that can wait at once. This
pathetically small limitation is an unfortunate, hard-coded restriction
of the Windows API. It kills most practical uses of my little library.
Fortunately there’s no limit on the number of handles we might want to
wait on, just the number of co-existing fibers.</p>

<p>When a fiber is about to return from its start routine, it yields one
last time and registers itself on the <code class="language-plaintext highlighter-rouge">dead_fiber</code> member. The scheduler
will delete this fiber as soon as it’s given control. Fibers never
<em>truly</em> return since that would terminate the program.</p>

<p>With this, the await function, <code class="language-plaintext highlighter-rouge">async_await()</code>, is pretty simple. It
registers the handle with the scheduler, then yields to the scheduler
fiber.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_await</span><span class="p">(</span><span class="n">HANDLE</span> <span class="n">h</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">h</span><span class="p">;</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">]</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">++</span><span class="p">;</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Caveat: The scheduler destroys this handle with <code class="language-plaintext highlighter-rouge">CloseHandle()</code> after it
signals, so don’t try to reuse it. This made my demonstration simpler,
but it might be better to not do this.</p>

<p>A fiber can exit at any time. Such an exit is inserted implicitly before
a fiber actually returns:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_exit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="n">GetCurrentFiber</span><span class="p">();</span>
    <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">main_fiber</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The start routine given to <code class="language-plaintext highlighter-rouge">async_start()</code> is actually wrapped in the
real start routine. This is how <code class="language-plaintext highlighter-rouge">async_exit()</code> is injected:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">fiber_wrapper</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="o">*</span><span class="n">fw</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="n">fw</span><span class="o">-&gt;</span><span class="n">func</span><span class="p">(</span><span class="n">fw</span><span class="o">-&gt;</span><span class="n">arg</span><span class="p">);</span>
    <span class="n">async_exit</span><span class="p">();</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">async_start</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span> <span class="o">==</span> <span class="n">MAXIMUM_WAIT_OBJECTS</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">fiber_wrapper</span> <span class="n">fw</span> <span class="o">=</span> <span class="p">{</span><span class="n">func</span><span class="p">,</span> <span class="n">arg</span><span class="p">};</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">CreateFiber</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">fiber_wrapper</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fw</span><span class="p">));</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The library provides a single awaitable function, <code class="language-plaintext highlighter-rouge">async_sleep()</code>. It
creates a “waitable timer” object, starts the countdown, and returns it.
(Notice how <code class="language-plaintext highlighter-rouge">SetWaitableTimer()</code> is a typically-ugly Win32 function with
excessive parameters.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HANDLE</span>
<span class="nf">async_sleep</span><span class="p">(</span><span class="kt">double</span> <span class="n">seconds</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">promise</span> <span class="o">=</span> <span class="n">CreateWaitableTimer</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">LARGE_INTEGER</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">t</span><span class="p">.</span><span class="n">QuadPart</span> <span class="o">=</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">long</span><span class="p">)(</span><span class="n">seconds</span> <span class="o">*</span> <span class="o">-</span><span class="mi">10000000</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>
    <span class="n">SetWaitableTimer</span><span class="p">(</span><span class="n">promise</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">t</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">promise</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A more realistic example would be overlapped I/O. For example, you’d
open a file (<code class="language-plaintext highlighter-rouge">CreateFile()</code>) in overlapped mode, then when you, say,
read from that file (<code class="language-plaintext highlighter-rouge">ReadFile()</code>) you create an event object
(<code class="language-plaintext highlighter-rouge">CreateEvent()</code>), populate an overlapped I/O structure with the event,
offset, and length, then finally await on the event object. The fiber
will be resumed when the operation is complete.</p>

<p>Side note: Unfortunately <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">overlapped I/O doesn’t work correctly for
files</a>, and many operations can’t be done asynchronously, like
opening files. When it comes to files, you’re <a href="https://blog.libtorrent.org/2012/10/asynchronous-disk-io/">better off using
dedicated threads</a> as <a href="http://docs.libuv.org/en/v1.x/design.html#file-i-o">libuv does</a> instead of overlapped I/O.
You can still await on these operations. You’d just await on the signal
from the thread doing synchronous I/O, not from overlapped I/O.</p>

<p>The most complex part is the scheduler, and it’s really not complex at
all:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">async_run</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Wait for next event */</span>
        <span class="n">DWORD</span> <span class="n">nhandles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="p">;</span>
        <span class="n">HANDLE</span> <span class="o">*</span><span class="n">handles</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">;</span>
        <span class="n">DWORD</span> <span class="n">r</span> <span class="o">=</span> <span class="n">WaitForMultipleObjects</span><span class="p">(</span><span class="n">nhandles</span><span class="p">,</span> <span class="n">handles</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">INFINITE</span><span class="p">);</span>

        <span class="cm">/* Remove event and fiber from waiting array */</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">fiber</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">];</span>
        <span class="n">CloseHandle</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]);</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">handles</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">async_loop</span><span class="p">.</span><span class="n">fibers</span><span class="p">[</span><span class="n">nhandles</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">async_loop</span><span class="p">.</span><span class="n">count</span><span class="o">--</span><span class="p">;</span>

        <span class="cm">/* Run the fiber */</span>
        <span class="n">SwitchToFiber</span><span class="p">(</span><span class="n">fiber</span><span class="p">);</span>

        <span class="cm">/* Destroy the fiber if it exited */</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">DeleteFiber</span><span class="p">(</span><span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span><span class="p">);</span>
            <span class="n">async_loop</span><span class="p">.</span><span class="n">dead_fiber</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is why the handles are in their own array. The array can be passed
directly to <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code>. The return value indicates which
handle was signaled. The handle is closed, the entry removed from the
scheduler, and then the fiber is resumed.</p>

<p>That <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> is what limits the number of fibers.
It’s not possible to wait on more than 64 handles at once! This is
hard-coded into the API. How? A return value of 64 is an error code, and
changing this would break the API. Remember what I said about being
locked into bad design decisions of the past?</p>

<p>To be fair, <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> was a doomed API anyway, just
like <code class="language-plaintext highlighter-rouge">select(2)</code> and <code class="language-plaintext highlighter-rouge">poll(2)</code> in POSIX. It scales very poorly since the
entire array of objects being waited on must be traversed on each call.
That’s terribly inefficient when waiting on large numbers of objects.
This sort of problem is solved by interfaces like kqueue (BSD), epoll
(Linux), and IOCP (Windows). Unfortunately <a href="https://news.ycombinator.com/item?id=11866562">IOCP doesn’t really fit this
particular problem well</a> — awaiting on kernel objects — so I
couldn’t use it.</p>

<p>When the awaiting fiber count is zero and the scheduler has control, all
fibers must have completed and there’s nothing left to do. However, the
caller can schedule more fibers and then restart the scheduler if
desired.</p>

<p>That’s all there is to it. Have a look at <a href="https://github.com/skeeto/fiber-await/blob/master/demo.c"><code class="language-plaintext highlighter-rouge">demo.c</code></a> to see how
the API looks in some trivial examples. On Linux you can see it in
action with <code class="language-plaintext highlighter-rouge">make check</code>. On Windows, you just <a href="/blog/2016/06/13/">need to compile
it</a>, then run it like a normal program. If there was a better
function than <code class="language-plaintext highlighter-rouge">WaitForMultipleObjects()</code> in the Windows API, I would
have considered turning this demonstration into a real library.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Endlessh: an SSH Tarpit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/03/22/"/>
    <id>urn:uuid:5429ee15-3d42-4af2-8690-f7f402870dd0</id>
    <updated>2019-03-22T17:26:45Z</updated>
    <category term="netsec"/><category term="python"/><category term="c"/><category term="posix"/><category term="asyncio"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19465967">on Hacker News</a> (<a href="https://news.ycombinator.com/item?id=24491453">later</a>), <a href="https://old.reddit.com/r/programming/comments/b4iq00/endlessh_an_ssh_tarpit/">on
reddit</a> (<a href="https://old.reddit.com/r/netsec/comments/b4dwjl/endlessh_an_ssh_tarpit/">also</a>), featured in <a href="https://www.youtube.com/watch?v=bM65iyRRW0A&amp;t=3m52s">BSD Now 294</a>.
Also check out <a href="https://github.com/bediger4000/ssh-tarpit-behavior">this Endlessh analysis</a>.</em></p>

<p>I’m a big fan of tarpits: a network service that intentionally inserts
delays in its protocol, slowing down clients by forcing them to wait.
This arrests the speed at which a bad actor can attack or probe the
host system, and it ties up some of the attacker’s resources that
might otherwise be spent attacking another host. When done well, a
tarpit imposes more cost on the attacker than the defender.</p>

<!--more-->

<p>The Internet is a very hostile place, and anyone who’s ever stood up
an Internet-facing IPv4 host has witnessed the immediate and
continuous attacks against their server. I’ve maintained <a href="/blog/2017/06/15/">such a
server</a> for nearly six years now, and more than 99% of my
incoming traffic has ill intent. One part of my defenses has been
tarpits in various forms. The latest addition is an SSH tarpit I wrote
a couple of months ago:</p>

<p><a href="https://github.com/skeeto/endlessh"><strong>Endlessh: an SSH tarpit</strong></a></p>

<p>This program opens a socket and pretends to be an SSH server. However,
it actually just ties up SSH clients with false promises indefinitely
— or at least until the client eventually gives up. After cloning the
repository, here’s how you can try it out for yourself (default port
2222):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make
$ ./endlessh &amp;
$ ssh -p2222 localhost
</code></pre></div></div>

<p>Your SSH client will hang there and wait for at least several days
before finally giving up. Like a mammoth in the La Brea Tar Pits, it
got itself stuck and can’t get itself out. As I write, my
Internet-facing SSH tarpit currently has 27 clients trapped in it. A
few of these have been connected for weeks. In one particular spike it
had 1,378 clients trapped at once, lasting about 20 hours.</p>

<p>My Internet-facing Endlessh server listens on port 22, which is the
standard SSH port. I long ago moved my real SSH server off to another
port where it sees a whole lot less SSH traffic — essentially none.
This makes the logs a whole lot more manageable. And (hopefully)
Endlessh convinces attackers not to look around for an SSH server on
another port.</p>

<p>How does it work? Endlessh exploits <a href="https://tools.ietf.org/html/rfc4253#section-4.2">a little paragraph in RFC
4253</a>, the SSH protocol specification. Immediately after the TCP
connection is established, and before negotiating the cryptography,
both ends send an identification string:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SSH-protoversion-softwareversion SP comments CR LF
</code></pre></div></div>

<p>The RFC also notes:</p>

<blockquote>
  <p>The server MAY send other lines of data before sending the version
string.</p>
</blockquote>

<p>There is no limit on the number of lines, just that these lines must
not begin with “SSH-“ since that would be ambiguous with the
identification string, and lines must not be longer than 255
characters including CRLF. So <strong>Endlessh sends and <em>endless</em> stream of
randomly-generated “other lines of data”</strong> without ever intending to
send a version string. By default it waits 10 seconds between each
line. This slows down the protocol, but prevents it from actually
timing out.</p>

<p>This means Endlessh need not know anything about cryptography or the
vast majority of the SSH protocol. It’s dead simple.</p>

<h3 id="implementation-strategies">Implementation strategies</h3>

<p>Ideally the tarpit’s resource footprint should be as small as
possible. It’s just a security tool, and the server does have an
actual purpose that doesn’t include being a tarpit. It should tie up
the attacker’s resources, not the server’s, and should generally be
unnoticeable. (Take note all those who write the awful “security”
products I have to tolerate at my day job.)</p>

<p>Even when many clients have been trapped, Endlessh spends more than
99.999% of its time waiting around, doing nothing. It wouldn’t even be
accurate to call it I/O-bound. If anything, it’s <em>timer-bound</em>,
waiting around before sending off the next line of data. <strong>The most
precious resource to conserve is <em>memory</em>.</strong></p>

<h4 id="processes">Processes</h4>

<p>The most straightforward way to implement something like Endlessh is a
fork server: accept a connection, fork, and the child simply alternates
between <code class="language-plaintext highlighter-rouge">sleep(3)</code> and <code class="language-plaintext highlighter-rouge">write(2)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">ssize_t</span> <span class="n">r</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">line</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>

    <span class="n">sleep</span><span class="p">(</span><span class="n">DELAY</span><span class="p">);</span>
    <span class="n">generate_line</span><span class="p">(</span><span class="n">line</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">line</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">line</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">!=</span> <span class="n">EINTR</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A process per connection is a lot of overhead when connections are
expected to be up hours or even weeks at a time. An attacker who knows
about this could exhaust the server’s resources with little effort by
opening up lots of connections.</p>

<h4 id="threads">Threads</h4>

<p>A better option is, instead of processes, to create a thread per
connection. On Linux <a href="/blog/2015/05/15/">this is practically the same thing</a>, but it’s
still better. However, you still have to allocate a stack for the thread
and the kernel will have to spend some resources managing the thread.</p>

<h4 id="poll">Poll</h4>

<p>For Endlessh I went for an even more lightweight version: a
single-threaded <code class="language-plaintext highlighter-rouge">poll(2)</code> server, analogous to stackless green threads.
The overhead per connection is about as low as it gets.</p>

<p>Clients that are being delayed are not registered in <code class="language-plaintext highlighter-rouge">poll(2)</code>. Their
only overhead is the socket object in the kernel, and another 78 bytes
to track them in Endlessh. Most of those bytes are used only for
accurate logging. Only those clients that are overdue for a new line
are registered for <code class="language-plaintext highlighter-rouge">poll(2)</code>.</p>

<p>When clients are waiting, but no clients are overdue, <code class="language-plaintext highlighter-rouge">poll(2)</code> is
essentially used in place of <code class="language-plaintext highlighter-rouge">sleep(3)</code>. Though since it still needs
to manage the <em>accept</em> server socket, it (almost) never actually waits
on <em>nothing</em>.</p>

<p>There’s an option to limit the total number of client connections so
that it doesn’t get out of hand. In this case it will stop polling the
accept socket until a client disconnects. I probably shouldn’t have
bothered with this option and instead relied on <code class="language-plaintext highlighter-rouge">ulimit</code>, a feature
already provided by the operating system.</p>

<p>I could have used epoll (Linux) or kqueue (BSD), which would be much
more efficient than <code class="language-plaintext highlighter-rouge">poll(2)</code>. The problem with <code class="language-plaintext highlighter-rouge">poll(2)</code> is that it’s
constantly registering and unregistering Endlessh on each of the
overdue sockets each time around the main loop. This is by far the
most CPU-intensive part of Endlessh, and it’s all inflicted on the
kernel. Most of the time, even with thousands of clients trapped in
the tarpit, only a small number of them at polled at once, so I opted
for better portability instead.</p>

<p>One consequence of not polling connections that are waiting is that
disconnections aren’t noticed in a timely fashion. This makes the logs
less accurate than I like, but otherwise it’s pretty harmless.
Unforunately even if I wanted to fix this, the <code class="language-plaintext highlighter-rouge">poll(2)</code> interface
isn’t quite equipped for it anyway.</p>

<h4 id="raw-sockets">Raw sockets</h4>

<p>With a <code class="language-plaintext highlighter-rouge">poll(2)</code> server, the biggest overhead remaining is in the
kernel, where it allocates send and receive buffers for each client
and manages the proper TCP state. The next step to reducing this
overhead is Endlessh opening a <em>raw socket</em> and speaking TCP itself,
bypassing most of the operating system’s TCP/IP stack.</p>

<p>Much of the TCP connection state doesn’t matter to Endlessh and doesn’t
need to be tracked. For example, it doesn’t care about any data sent by
the client, so no receive buffer is needed, and any data that arrives
could be dropped on the floor.</p>

<p>Even more, raw sockets would allow for some even nastier tarpit tricks.
Despite the long delays between data lines, the kernel itself responds
very quickly on the TCP layer and below. ACKs are sent back quickly and
so on. An astute attacker could detect that the delay is artificial,
imposed above the TCP layer by an application.</p>

<p>If Endlessh worked at the TCP layer, it could <a href="https://nyman.re/super-simple-ssh-tarpit/">tarpit the TCP protocol
itself</a>. It could introduce artificial “noise” to the connection
that requires packet retransmissions, delay ACKs, etc. It would look a
lot more like network problems than a tarpit.</p>

<p>I haven’t taken Endlessh this far, nor do I plan to do so. At the
moment attackers either have a hard timeout, so this wouldn’t matter,
or they’re pretty dumb and Endlessh already works well enough.</p>

<h3 id="asyncio-and-other-tarpits">asyncio and other tarpits</h3>

<p>Since writing Endless <a href="/blog/2019/03/10/">I’ve learned about Python’s <code class="language-plaintext highlighter-rouge">asyncio</code></a>, and
it’s actually a near perfect fit for this problem. I should have just
used it in the first place. The hard part is already implemented within
<code class="language-plaintext highlighter-rouge">asyncio</code>, and the problem isn’t CPU-bound, so being written in Python
<a href="/blog/2019/02/24/">doesn’t matter</a>.</p>

<p>Here’s a simplified (no logging, no configuration, etc.) version of
Endlessh implemented in about 20 lines of Python 3.7:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">_reader</span><span class="p">,</span> <span class="n">writer</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
            <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'%x</span><span class="se">\r\n</span><span class="s">'</span> <span class="o">%</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">))</span>
            <span class="k">await</span> <span class="n">writer</span><span class="p">.</span><span class="n">drain</span><span class="p">()</span>
    <span class="k">except</span> <span class="nb">ConnectionResetError</span><span class="p">:</span>
        <span class="k">pass</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">server</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">start_server</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="s">'0.0.0.0'</span><span class="p">,</span> <span class="mi">2222</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">server</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">server</span><span class="p">.</span><span class="n">serve_forever</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>Since Python coroutines are stackless, the per-connection memory
overhead is comparable to the C version. So it seems asyncio is
perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up
attackers trying to exploit HTTP servers. It slowly sends a random,
endless HTTP header:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">_reader</span><span class="p">,</span> <span class="n">writer</span><span class="p">):</span>
    <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'HTTP/1.1 200 OK</span><span class="se">\r\n</span><span class="s">'</span><span class="p">)</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
            <span class="n">header</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">)</span>
            <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s">'X-%x: %x</span><span class="se">\r\n</span><span class="s">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">header</span><span class="p">,</span> <span class="n">value</span><span class="p">))</span>
            <span class="k">await</span> <span class="n">writer</span><span class="p">.</span><span class="n">drain</span><span class="p">()</span>
    <span class="k">except</span> <span class="nb">ConnectionResetError</span><span class="p">:</span>
        <span class="k">pass</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">server</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">start_server</span><span class="p">(</span><span class="n">handler</span><span class="p">,</span> <span class="s">'0.0.0.0'</span><span class="p">,</span> <span class="mi">8080</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">server</span><span class="p">:</span>
        <span class="k">await</span> <span class="n">server</span><span class="p">.</span><span class="n">serve_forever</span><span class="p">()</span>

<span class="n">asyncio</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">())</span>
</code></pre></div></div>

<p>Try it out for yourself. Firefox and Chrome will spin on that server
for hours before giving up. I have yet to see curl actually timeout on
its own in the default settings (<code class="language-plaintext highlighter-rouge">--max-time</code>/<code class="language-plaintext highlighter-rouge">-m</code> does work
correctly, though).</p>

<p>Parting exercise for the reader: Using the examples above as a starting
point, implement an SMTP tarpit using asyncio. Bonus points for using
TLS connections and testing it against real spammers.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>The Day I Fell in Love with Fuzzing</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/01/25/"/>
    <id>urn:uuid:9ab4d645-222e-37f6-0d41-6db1e5c126c6</id>
    <updated>2019-01-25T21:52:45Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p><em>Follow-up: <a href="/blog/2025/02/05/">Tips for more effective fuzz testing with AFL++</a></em></p>

<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=19019048">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/akrcyp/the_day_i_fell_in_love_with_fuzzing/">on reddit</a>.</em></p>

<p>In 2007 I wrote a pair of modding tools, <a href="https://github.com/skeeto/binitools">binitools</a>, for a space
trading and combat simulation game named <a href="https://en.wikipedia.org/wiki/Freelancer_(video_game)"><em>Freelancer</em></a>. The game
stores its non-art assets in the format of “binary INI” files, or “BINI”
files. The motivation for the binary format over traditional INI files
was probably performance: it’s faster to load and read these files than
it is to parse arbitrary text in INI format.</p>

<!--more-->

<p>Much of the in-game content can be changed simply by modifying these
files — changing time names, editing commodity prices, tweaking ship
statistics, or even adding new ships to the game. The binary nature
makes them unsuitable to in-place modification, so the natural approach
is to convert them to text INI files, make the desired modifications
using a text editor, then convert back to the BINI format and replace
the file in the game’s installation.</p>

<p>I didn’t reverse engineer the BINI format, nor was I the first person
the create tools to edit them. The existing tools weren’t to my tastes,
and I had my own vision for how they should work — an interface more
closely following <a href="http://www.catb.org/esr/writings/taoup/html/">the Unix tradition</a> despite the target being a
Windows game.</p>

<p>When I got started, I had just learned how to use <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/yacc.html">yacc</a> (really
<a href="https://www.gnu.org/software/bison/">Bison</a>) and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html">lex</a> (really <a href="https://github.com/westes/flex">flex</a>), as well as
Autoconf, so I went all-in with these newly-discovered tools. It was
exciting to try them out in a real-world situation, though I slavishly
aped the practices of other open source projects without really
understanding why things were they way they were. Due to the use of
yacc/lex and the configure script build, compiling the project required
a full, Unix-like environment. This is all visible in <a href="https://github.com/skeeto/binitools/tree/original">the original
version of the source</a>.</p>

<p>The project was moderately successful in two ways. First, I was able to
use the tools to modify the game. Second, other people were using the
tools, since the binaries I built show up in various collections of
Freelancer modding tools online.</p>

<h3 id="the-rewrite">The Rewrite</h3>

<p>That’s the way things were until mid-2018 when I revisited the project.
Ever look at your own old code and wonder what they heck you were
thinking? My INI format was far more rigid and strict than necessary, I
was doing questionable things when writing out binary data, and the
build wasn’t even working correctly.</p>

<p>With an additional decade of experience under my belt, I knew I could do
<em>way</em> better if I were to rewrite these tools today. So, over the course
of a few days, I did, from scratch. That’s what’s visible in the master
branch today.</p>

<p><a href="/blog/2017/03/30/">I like to keep things simple</a> which meant no more Autoconf, and
instead <a href="/blog/2017/08/20/">a simple, portable Makefile</a>. No more yacc or lex, and
instead a hand-coded parser. Using only conforming, portable C. The
result was so simple that I can <a href="/blog/2016/06/13/">build using Visual Studio</a> in a
single, short command, so the Makefile isn’t all that necessary. With
one small tweak (replace <code class="language-plaintext highlighter-rouge">stdint.h</code> with a <code class="language-plaintext highlighter-rouge">typedef</code>), I can even <a href="/blog/2018/04/13/">build
and run binitools in DOS</a>.</p>

<p>The new version is faster, leaner, cleaner, and simpler. It’s far more
flexible about its INI input, so its easier to use. But is it more
correct?</p>

<h3 id="fuzzing">Fuzzing</h3>

<p>I’ve been interested in <a href="https://labs.mwrinfosecurity.com/blog/what-the-fuzz/">fuzzing</a> for years, especially
<a href="http://lcamtuf.coredump.cx/afl/">american fuzzy lop</a>, or <em>afl</em>. However, I wasn’t having success
with it. I’d fuzz some of the tools I use regularly, and it wouldn’t
find anything of note, at least not before I gave up. I fuzzed <a href="https://github.com/skeeto/pdjson">my
JSON library</a>, and somehow it turned up nothing. Surely my
JSON parser couldn’t be <em>that</em> robust already, could it? Fuzzing just
wasn’t accomplishing anything for me. (As it turns out, my JSON
library <em>is</em> quite robust, thanks in large part to various
contributors!)</p>

<p>So I’ve got this relatively new INI parser, and while it can
successfully parse and correctly re-assemble the game’s original set of
BINI files, it hasn’t <em>really</em> been exercised that much. Surely there’s
something in here for a fuzzer to find. Plus I don’t even have to write
a line of code in order to run afl against it. The tools already read
from standard input by default, which is perfect.</p>

<p>Assuming you’ve got the necessary tools installed (make, gcc, afl),
here’s how easy it is to start fuzzing binitools:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=afl-gcc
$ mkdir in out
$ echo '[x]' &gt; in/empty
$ afl-fuzz -i in -o out -- ./bini
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">bini</code> utility takes INI as input and produces BINI as output, so
it’s far more interesting to fuzz than its inverse, <code class="language-plaintext highlighter-rouge">unbini</code>. Since
<code class="language-plaintext highlighter-rouge">unbini</code> parses relatively simple binary data, there are (probably) no
bugs for the fuzzer to find. I did try anyway just in case.</p>

<p><img src="/img/screenshot/afl.png" alt="" /></p>

<p>In my example above, I swapped out the default compiler for afl’s GCC
wrapper (<code class="language-plaintext highlighter-rouge">CC=afl-gcc</code>). It calls GCC in the background, but in doing so
adds its own instrumentation to the binary. When fuzzing, <code class="language-plaintext highlighter-rouge">afl-fuzz</code>
uses that instrumentation to monitor the program’s execution path. The
<a href="http://lcamtuf.coredump.cx/afl/technical_details.txt">afl whitepaper</a> explains the technical details.</p>

<p>I also created input and output directories, placing a minimal, working
example into the input directory, which gives afl a starting point. As
afl runs, it mutates a queue of inputs and observes the changes on the
program’s execution. The output directory contains the results and, more
importantly, a corpus of inputs that cause unique execution paths. In
other words, the fuzzer output will be lots of inputs that exercise many
different edge cases.</p>

<p>The most exciting and dreaded result is a crash. The first time I ran it
against binitools, <code class="language-plaintext highlighter-rouge">bini</code> had <em>many</em> such crashes. Within minutes, afl
was finding a number of subtle and interesting bugs in my program, which
was <em>incredibly</em> useful. It even discovered an unlikely <a href="https://github.com/skeeto/binitools/commit/b695aec7d0021299cbd83c8c6983055f16d11507">stale pointer
bug</a> by exercising different orderings for various memory
allocations. This particular bug was the turning point that made me
realize the value of fuzzing.</p>

<p>Not all the bugs it found led to crashes. I also combed through the
outputs to see what sorts of inputs were succeeding, what was failing,
and observe how my program handled various edge cases. It was rejecting
some inputs I thought should be valid, accepting some I thought should
be invalid, and interpreting some in ways I hadn’t intended. So even
after I fixed the crashing inputs, I still made tweaks to the parser to
fix each of these troublesome inputs.</p>

<h3 id="building-a-test-suite">Building a test suite</h3>

<p>Once I combed out all the fuzzer-discovered bugs, and I agreed with the
parser on how all the various edge cases should be handled, I turned the
fuzzer’s corpus into a test suite — though not directly.</p>

<p>I had run the fuzzer in parallel — a process that is explained in the
afl documentation — so I had lots of redundant inputs. By redundant I
mean that the inputs are different but have the same execution path.
Fortunately afl has a tool to deal with this: <code class="language-plaintext highlighter-rouge">afl-cmin</code>, the corpus
minimization tool. It eliminates all the redundant inputs.</p>

<p>Second, many of these inputs were longer than necessary in order to
invoke their unique execution path. There’s <code class="language-plaintext highlighter-rouge">afl-tmin</code>, the test case
minimizer, which I used to further shrink my test corpus.</p>

<p>I sorted the valid from invalid inputs and checked them into the
repository. Have a look at all the wacky inputs <a href="https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html">invented by</a> the
fuzzer starting from my single, minimal input:</p>

<ul>
  <li><a href="https://github.com/skeeto/binitools/tree/master/tests/valid">valid inputs</a></li>
  <li><a href="https://github.com/skeeto/binitools/tree/master/tests/invalid">invalid inputs</a></li>
</ul>

<p>This essentially locks down the parser, and the test suite ensures a
particular build behaves in a <em>very</em> specific way. This is most useful
for ensuring that builds on other platforms and by other compilers are
indeed behaving identically with respect to their outputs. My test suite
even revealed a bug in diet libc, as binitools doesn’t pass the tests
when linked against it. If I were to make non-trivial changes to the
parser, I’d essentially need to scrap the current test suite and start
over, having afl generate an entire new corpus for the new parser.</p>

<p>Fuzzing has certainly proven itself to be a powerful technique. It found
a number of bugs that I likely wouldn’t have otherwise discovered on my
own. I’ve since gotten more savvy on its use and have used it on other
software — not just software I’ve written myself — and discovered more
bugs. It’s got a permanent slot on my software developer toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A JavaScript Typed Array Gotcha</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2019/01/23/"/>
    <id>urn:uuid:da8bd8c0-6003-3d77-5409-16db63420368</id>
    <updated>2019-01-23T02:50:30Z</updated>
    <category term="c"/><category term="javascript"/><category term="lang"/>
    <content type="html">
      <![CDATA[<p>JavaScript’s prefix increment and decrement operators can be
surprising when applied to typed arrays. It caught be by surprise when
I was <a href="https://github.com/skeeto/ulid-c">porting some C code</a> over <a href="https://github.com/skeeto/ulid-js">to JavaScript</a> Just
using your brain to execute this code, what do you believe is the
value of <code class="language-plaintext highlighter-rouge">r</code>?</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">array</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Uint8Array</span><span class="p">([</span><span class="mi">255</span><span class="p">]);</span>
<span class="kd">let</span> <span class="nx">r</span> <span class="o">=</span> <span class="o">++</span><span class="nx">array</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
</code></pre></div></div>

<p>The increment and decrement operators originated in the B programming
language. Its closest living relative today is C, and, as far as these
operators are concered, C can be considered an ancestor of JavaScript.
So what is the value of <code class="language-plaintext highlighter-rouge">r</code> in this similar C code?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="n">array</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">255</span><span class="p">};</span>
<span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">++</span><span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
</code></pre></div></div>

<p>Of course, if they were the same then there would be nothing to write
about, so that should make it easier to guess if you aren’t sure. The
answer: In JavaScript, <code class="language-plaintext highlighter-rouge">r</code> is 256. In C, <code class="language-plaintext highlighter-rouge">r</code> is 0.</p>

<p>What happened to me was that I wrote an 80-bit integer increment
routine in C like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="n">array</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
<span class="cm">/* ... */</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">9</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">++</span><span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
        <span class="k">break</span><span class="p">;</span>
</code></pre></div></div>

<p>But I was getting the wrong result over in JavaScript from essentially
the same code:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">array</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Uint8Array</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="cm">/* ... */</span>
<span class="k">for</span> <span class="p">(</span><span class="kd">let</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">9</span><span class="p">;</span> <span class="nx">i</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span><span class="o">--</span><span class="p">)</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">++</span><span class="nx">array</span><span class="p">[</span><span class="nx">i</span><span class="p">])</span>
        <span class="k">break</span><span class="p">;</span>
</code></pre></div></div>

<p>So what’s going on here?</p>

<h3 id="javascript-specification">JavaScript specification</h3>

<p>The ES5 specification says this about <a href="https://es5.github.io/#x11.4.4">the prefix increment
operator</a>:</p>

<blockquote>
  <p>Let <em>expr</em> be the result of evaluating UnaryExpression.</p>

  <ol>
    <li>
      <p>Throw a SyntaxError exception if the following conditions are all
true: [omitted]</p>
    </li>
    <li>
      <p>Let <em>oldValue</em> be ToNumber(GetValue(<em>expr</em>)).</p>
    </li>
    <li>
      <p>Let <em>newValue</em> be the result of adding the value 1 to <em>oldValue</em>,
using the same rules as for the + operator (see 11.6.3).</p>
    </li>
    <li>
      <p>Call PutValue(<em>expr</em>, <em>newValue</em>).</p>
    </li>
  </ol>

  <p>Return <em>newValue</em>.</p>
</blockquote>

<p>So, <em>oldValue</em> is 255. This is a double precision float because all
numbers in JavaScript (outside of the bitwise operations) are double
precision floating point. Add 1 to this value to get 256, which is
<em>newValue</em>. When <em>newValue</em> is stored in the array via PutValue(), it’s
converted to an unsigned 8-bit integer, which truncates it to 0.</p>

<p>However, <em>newValue</em> is returned, not the value that was actually stored
in the array!</p>

<p>Since JavaScript is dynamically typed, this difference did not
actually matter until typed arrays are involved. I suspect if typed
arrays were in JavaScript from the beginning, the specified behavior
would be more in line with C.</p>

<p>This behavior isn’t limited to the prefix operators. Consider
assignment:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">array</span> <span class="o">=</span> <span class="k">new</span> <span class="nb">Uint8Array</span><span class="p">([</span><span class="mi">255</span><span class="p">]);</span>
<span class="kd">let</span> <span class="nx">r</span> <span class="o">=</span> <span class="p">(</span><span class="nx">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="nx">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="kd">let</span> <span class="nx">s</span> <span class="o">=</span> <span class="p">(</span><span class="nx">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>Both <code class="language-plaintext highlighter-rouge">r</code> and <code class="language-plaintext highlighter-rouge">s</code> will still be 256. The result of the assignment
operators is a similar story:</p>

<blockquote>
  <p>LeftHandSideExpression = AssignmentExpression is evaluated as
 follows:</p>

  <ol>
    <li>
      <p>Let <em>lref</em> be the result of evaluating LeftHandSideExpression.</p>
    </li>
    <li>
      <p>Let <em>rref</em> be the result of evaluating AssignmentExpression.</p>
    </li>
    <li>
      <p>Let <em>rval</em> be GetValue(<em>rref</em>).</p>
    </li>
    <li>
      <p>Throw a SyntaxError exception if the following conditions are all
true: [omitted]</p>
    </li>
    <li>
      <p>Call PutValue(<em>lref</em>, <em>rval</em>).</p>
    </li>
    <li>
      <p>Return <em>rval</em>.</p>
    </li>
  </ol>
</blockquote>

<p>Again, the result of the expression is independent of how it was stored
with PutValue().</p>

<h3 id="c-specification">C specification</h3>

<p>I’ll be referencing the original C89/C90 standard. The C specification
requires a little more work to get to the bottom of the issue. Starting
with 3.3.3.1 (Prefix increment and decrement operators):</p>

<blockquote>
  <p>The value of the operand of the prefix ++ operator is incremented. The
result is the new value of the operand after incrementation. The
expression ++E is equivalent to (E+=1).</p>
</blockquote>

<p>Later in 3.3.16.2 (Compound assignment):</p>

<blockquote>
  <p>A compound assignment of the form E1 op = E2 differs from the simple
assignment expression E1 = E1 op (E2) only in that the lvalue E1 is
evaluated only once.</p>
</blockquote>

<p>Then finally in 3.3.16 (Assignment operators):</p>

<blockquote>
  <p>An assignment operator stores a value in the object designated by
the left operand. An assignment expression has the value of the left
operand <strong>after the assignment</strong>, but is not an lvalue.</p>
</blockquote>

<p>So the result is explicitly the value after assignment. Let’s look at
this step by step after rewriting the expression.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>In C, all integer operations are performed with <em>at least</em> <code class="language-plaintext highlighter-rouge">int</code>
precision. Smaller integers are implicitly promoted to <code class="language-plaintext highlighter-rouge">int</code> before the
operation. The value of array[0] is 255, and, since <code class="language-plaintext highlighter-rouge">uint8_t</code> is smaller
than <code class="language-plaintext highlighter-rouge">int</code>, it gets promoted to <code class="language-plaintext highlighter-rouge">int</code>. Additionally, the literal
constant 1 is also an <code class="language-plaintext highlighter-rouge">int</code>, so there are actually two reasons for this
promotion.</p>

<p>So since these are <code class="language-plaintext highlighter-rouge">int</code> values, the result of the addition is 256, like
in JavaScript. To store the result, this value is then demoted to
<code class="language-plaintext highlighter-rouge">uint8_t</code> and truncated to 0. Finally, this post-assignment 0 is the
result of the expression, not the right-hand result as in JavaScript.</p>

<h3 id="specifications-are-useful">Specifications are useful</h3>

<p>These situations are why I prefer programming languages that have a
formal and approachable specification. If there’s no specification and
I’m observing <a href="https://old.reddit.com/r/matlab/comments/ae68bh/_/edmysxr/">undocumented, idiosyncratic behavior</a>, is this
just some subtle quirk of the current implementation — e.g. something
that might change without notice in the future — or is it intended
behavior that I can rely upon for correctness?</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A Survey of $RANDOM</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/12/25/"/>
    <id>urn:uuid:071e3ec5-fe1d-309a-3e66-3b590a96ac2c</id>
    <updated>2018-12-25T00:05:38Z</updated>
    <category term="linux"/><category term="bsd"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Most Bourne shell clones support a special <code class="language-plaintext highlighter-rouge">RANDOM</code> environment
variable that evaluates to a random value between 0 and 32,767 (e.g.
15 bits). Assigment to the variable seeds the generator. This variable
is an extension and <a href="http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/utilities/V3_chap02.html">did not appear</a> in the original Unix Bourne
shell. Despite this, the different Bourne-like shells that implement
it have converged to the same interface, but <em>only</em> the interface.
Each implementation differs in interesting ways. In this article we’ll
explore how <code class="language-plaintext highlighter-rouge">$RANDOM</code> is implemented in various Bourne-like shells.</p>

<p><del>Unfortunately I was unable to determine the origin of <code class="language-plaintext highlighter-rouge">$RANDOM</code>.</del>
Nobody was doing a good job tracking source code changes before the
mid-1990s, so that history appears to be lost. Bash was first released
in 1989, but the earliest version I could find was 1.14.7, released in 1996.
KornShell was first released in 1983, but the earliest source I could
find <a href="https://web.archive.org/web/20120613182836/http://www.research.att.com/sw/download/man/man1/ksh.html">was from 1993</a>. In both cases <code class="language-plaintext highlighter-rouge">$RANDOM</code> already existed. My
guess is that it first appeared in one of these two shells, probably
KornShell.</p>

<p><strong>Update</strong>: Quentin Barnes has informed me that his 1986 copy of
KornShell (a.k.a. ksh86) implements <code class="language-plaintext highlighter-rouge">$RANDOM</code>. This predates Bash and
makes it likely that this feature originated in KornShell.</p>

<h3 id="bash">Bash</h3>

<p>Of all the shells I’m going to discuss, Bash has the most interesting
history. It never made use use of <code class="language-plaintext highlighter-rouge">srand(3)</code> / <code class="language-plaintext highlighter-rouge">rand(3)</code> and instead
uses its own generator — which is generally <a href="/blog/2017/09/21/">what I prefer</a>. Prior
to Bash 4.0, it used the crummy linear congruential generator (LCG)
<a href="http://port70.net/~nsz/c/c89/c89-draft.html#4.10.2.2">found in the C89 standard</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">rseed</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">brand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">*</span> <span class="mi">1103515245</span> <span class="o">+</span> <span class="mi">12345</span><span class="p">;</span>
  <span class="k">return</span> <span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)((</span><span class="n">rseed</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">32767</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For some reason it was naïvely decided that <code class="language-plaintext highlighter-rouge">$RANDOM</code> should never
produce the same value twice in a row. The caller of <code class="language-plaintext highlighter-rouge">brand()</code> filters
the output and discards repeats before returning to the shell script.
This actually <em>reduces</em> the quality of the generator further since it
increases correlation between separate outputs.</p>

<p>When the shell starts up, <code class="language-plaintext highlighter-rouge">rseed</code> is seeded from the PID and the current
time in seconds. These values are literally summed and used as the seed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Note: not the literal code, but equivalent. */</span>
<span class="n">rseed</span> <span class="o">=</span> <span class="n">getpid</span><span class="p">()</span> <span class="o">+</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Subshells, which fork and initally share an <code class="language-plaintext highlighter-rouge">rseed</code>, are given similar
treatment:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rseed</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">+</span> <span class="n">getpid</span><span class="p">()</span> <span class="o">+</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Notice there’s no <a href="/blog/2018/07/31/">hashing</a> or <a href="http://www.pcg-random.org/posts/developing-a-seed_seq-alternative.html">mixing</a> of these values, so
there’s no avalanche effect. That would have prevented shells that start
around the same time from having related initial random sequences.</p>

<p>With Bash 4.0, released in 2009, the algorithm was changed to a
<a href="http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf">Park–Miller multiplicative LCG</a> from 1988:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span>
<span class="nf">brand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="kt">long</span> <span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">;</span>

  <span class="cm">/* can't seed with 0. */</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">rseed</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="mi">123459876</span><span class="p">;</span>
  <span class="n">h</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">/</span> <span class="mi">127773</span><span class="p">;</span>
  <span class="n">l</span> <span class="o">=</span> <span class="n">rseed</span> <span class="o">%</span> <span class="mi">127773</span><span class="p">;</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
  <span class="k">return</span> <span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)(</span><span class="n">rseed</span> <span class="o">&amp;</span> <span class="mi">32767</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s actually a subtle mistake in this implementation compared to the
generator described in the paper. This function will generate different
numbers than the paper, and it will generate different numbers on
different hosts! More on that later.</p>

<p>This algorithm is a <a href="http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html">much better choice</a> than the previous LCG.
There were many more options available in 2009 compared to 1989, but,
honestly, this generator is pretty reasonable for this application.
Bash is <em>so slow</em> that you’re never practically going to generate
enough numbers for the small state to matter. Since the Park–Miller
algorithm is older than Bash, they could have used this in the first
place.</p>

<p>I considered submitting a patch to switch to something more modern.
However, given Bash’s constraints, it’s harder said than done.
Portability to weird systems is still a concern, and I expect they’d
reject a patch that started making use of <code class="language-plaintext highlighter-rouge">long long</code> in the PRNG.
They still support pre-ANSI C compilers that don’t have 64-bit
arithmetic.</p>

<p>However, what still really <em>could</em> be improved is seeding. In Bash 4.x
here’s what it looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">seedrand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">;</span>

  <span class="n">gettimeofday</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">tv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
  <span class="n">sbrand</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">^</span> <span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span> <span class="o">^</span> <span class="n">getpid</span> <span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Seeding is both better and worse. It’s better that it’s seeded from a
higher resolution clock (milliseconds), so two shells started close in
time have more variation. However, it’s “mixed” with XOR, which, in
this case, is worse than addition.</p>

<p>For example, imagine two Bash shells started one millsecond apart. Both
<code class="language-plaintext highlighter-rouge">tv_usec</code> and <code class="language-plaintext highlighter-rouge">getpid()</code> are incremented by one. Those increments are
likely to cancel each other out by an XOR, and they end up with the same
seed.</p>

<p>Instead, each of those quantities should be hashed before mixing. Here’s
a rough example using my <a href="https://github.com/skeeto/hash-prospector#three-round-functions"><code class="language-plaintext highlighter-rouge">triple32()</code> hash</a> (adapted to glorious
GNU-style pre-ANSI C):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span>
<span class="n">hash32</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span>
     <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">x</span><span class="p">;</span>
<span class="p">{</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xed5ad4bbUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xac4c1b51UL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x31848babUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">&amp;=</span> <span class="mh">0xffffffffUL</span><span class="p">;</span>
  <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">;</span>
  <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">seedrand</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv</span><span class="p">;</span>

  <span class="n">gettimeofday</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">tv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
  <span class="n">sbrand</span> <span class="p">(</span><span class="n">hash32</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_sec</span><span class="p">)</span> <span class="o">^</span>
          <span class="n">hash32</span> <span class="p">(</span><span class="n">hash32</span> <span class="p">(</span><span class="n">tv</span><span class="p">.</span><span class="n">tv_usec</span><span class="p">)</span> <span class="o">^</span> <span class="n">getpid</span> <span class="p">()));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I had said there’s there’s a mistake in the Bash implementation of
Park–Miller. Take a closer look at the types and the assignment to
rseed:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="cm">/* The variables */</span>
  <span class="kt">long</span> <span class="n">h</span><span class="p">,</span> <span class="n">l</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">rseed</span><span class="p">;</span>

  <span class="cm">/* The assignment */</span>
  <span class="n">rseed</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
</code></pre></div></div>

<p>The result of the substraction can be negative, and that negative
value is converted to <code class="language-plaintext highlighter-rouge">unsigned long</code>. The C standard says
<code class="language-plaintext highlighter-rouge">ULONG_MAX + 1</code> is added to make the value positive. <code class="language-plaintext highlighter-rouge">ULONG_MAX</code>
varies by platform — typicially <code class="language-plaintext highlighter-rouge">long</code> is either 32 bits or 64 bits —
so the results also vary. Here’s how the paper defined it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kt">long</span> <span class="n">test</span><span class="p">;</span>

  <span class="n">test</span> <span class="o">=</span> <span class="mi">16807</span> <span class="o">*</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">2836</span> <span class="o">*</span> <span class="n">h</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">test</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="n">test</span><span class="p">;</span>
  <span class="k">else</span>
    <span class="n">rseed</span> <span class="o">=</span> <span class="n">test</span> <span class="o">+</span> <span class="mi">2147483647</span><span class="p">;</span>
</code></pre></div></div>

<p>As far as I can tell, this mistake doesn’t hurt the quality of the
generator.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634

$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115
</code></pre></div></div>

<h3 id="zsh">Zsh</h3>

<p>In contrast to Bash, Zsh is the most straightforward: defer to
<code class="language-plaintext highlighter-rouge">rand(3)</code>. Its <code class="language-plaintext highlighter-rouge">$RANDOM</code> can return the same value twice in a row,
assuming that <code class="language-plaintext highlighter-rouge">rand(3)</code> does.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zlong</span>
<span class="nf">randomgetfn</span><span class="p">(</span><span class="n">UNUSED</span><span class="p">(</span><span class="n">Param</span> <span class="n">pm</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">rand</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mh">0x7fff</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">randomsetfn</span><span class="p">(</span><span class="n">UNUSED</span><span class="p">(</span><span class="n">Param</span> <span class="n">pm</span><span class="p">),</span> <span class="n">zlong</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">srand</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)</span><span class="n">v</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A cool feature is that means you could override it if you wanted with <a href="https://xkcd.com/221/">a
custom generator</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">rand</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// chosen by fair dice roll.</span>
              <span class="c1">// guaranteed to be random.</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4
</code></pre></div></div>

<p>This trick also applies to the rest of the shells below.</p>

<h3 id="kornshell-ksh">KornShell (ksh)</h3>

<p>KornShell originated in 1983, but it was finally released under an open
source license in 2005. There’s a clone of KornShell called Public
Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but
I’ll get to that next.</p>

<p>KornShell defers to <code class="language-plaintext highlighter-rouge">rand(3)</code>, but it does some additional naïve
filtering on the output. When the shell starts up, it generates 10
values from <code class="language-plaintext highlighter-rouge">rand()</code>. If any of them are larger than 32,767 then it will
shift right by three all generated numbers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RANDMASK 0x7fff
</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">n</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">n</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Don't use lower bits when rand() generates large numbers.</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&gt;</span> <span class="n">RANDMASK</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">rand_shift</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
            <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Why not just look at <code class="language-plaintext highlighter-rouge">RAND_MAX</code>? I guess they didn’t think of it.</p>

<p><strong>Update</strong>: Quentin Barnes pointed out that <code class="language-plaintext highlighter-rouge">RAND_MAX</code> didn’t exist
until POSIX standardization in 1988. The constant <a href="https://github.com/dspinellis/unix-history-repo/commit/1cc1b02a4361">first appeared in
Unix in 1990</a>. This KornShell code either predates the standard
or needed to work on systems that predate the standard.</p>

<p>Like Bash, repeated values are not allowed. I suspect one shell got this
idea from the other.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">do</span> <span class="p">{</span>
        <span class="n">cur</span> <span class="o">=</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="n">rand_shift</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">RANDMASK</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">cur</span> <span class="o">==</span> <span class="n">last</span><span class="p">);</span>
</code></pre></div></div>

<p>Who came up with this strange idea first?</p>

<h3 id="openbsds-public-domain-korn-shell-pdksh">OpenBSD’s Public Domain Korn Shell (pdksh)</h3>

<p>I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I
ever touch in practice, and its <code class="language-plaintext highlighter-rouge">$RANDOM</code> is the most interesting of the
pdksh forks — at least since 2014.</p>

<p>Like Zsh, pdksh simply defers to <code class="language-plaintext highlighter-rouge">rand(3)</code>. However, OpenBSD’s <code class="language-plaintext highlighter-rouge">rand(3)</code>
is <a href="https://marc.info/?l=openbsd-tech&amp;m=141807224826859&amp;w=2">infamously and proudly non-standard</a>. By default it returns
<em>non-deterministic</em>, cryptographic-quality results seeded from system
entropy (via the misnamed <a href="https://man.openbsd.org/arc4random.3"><code class="language-plaintext highlighter-rouge">arc4random(3)</code></a>), à la <code class="language-plaintext highlighter-rouge">/dev/urandom</code>.
Its <code class="language-plaintext highlighter-rouge">$RANDOM</code> inherits this behavior.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">setint</span><span class="p">(</span><span class="n">vp</span><span class="p">,</span> <span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mh">0x7fff</span><span class="p">));</span>
</code></pre></div></div>

<p>However, if a value is assigned to <code class="language-plaintext highlighter-rouge">$RANDOM</code> in order to seed it, it
reverts to its old pre-2014 deterministic generation via
<a href="https://man.openbsd.org/rand"><code class="language-plaintext highlighter-rouge">srand_deterministic(3)</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">srand_deterministic</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">)</span><span class="n">intval</span><span class="p">(</span><span class="n">vp</span><span class="p">));</span>
</code></pre></div></div>

<p>OpenBSD’s deterministic <code class="language-plaintext highlighter-rouge">rand(3)</code> is the crummy LCG from the C89
standard, just like Bash 3.x. So if you assign to <code class="language-plaintext highlighter-rouge">$RANDOM</code>, you’ll get
nearly the same results as Bash 3.x and earlier — the only difference
being that it can repeat numbers.</p>

<p>That’s a slick upgrade to the old interface without breaking anything,
making it my favorite version <code class="language-plaintext highlighter-rouge">$RANDOM</code> for any shell.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Why Aren't There C Conferences?</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/11/21/"/>
    <id>urn:uuid:78211c4b-f553-313d-659f-15cda1339893</id>
    <updated>2018-11-21T17:25:45Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p><em>This article was <a href="https://news.ycombinator.com/item?id=18504879">discussed on Hacker News</a>.</em></p>

<p>Most widely-used programming languages have at least one regular
conference dedicated to discussing it. Heck, even <a href="https://www.european-lisp-symposium.org/">Lisp has
one</a>. It’s a place to talk about the latest developments of the
language, recent and upcoming standards, and so on. However, C is a
notable exception. Despite <a href="https://skeeto.s3.amazonaws.com/share/onward17-essays2.pdf">its role as the foundation</a> of the
entire software ecosystem, there aren’t any regular conferences about
C. I have a couple of theories about why.</p>

<p>First, C is so fundamental and ubiquitous that a conference about C
would be too general. There are so many different uses ranging across
embedded development, operating system kernels, systems programming,
application development, and, most recently, web development
(WebAssembly). It’s just not a cohesive enough topic. Any conference
that might be about C is instead focused on some particular subset of
its application. It’s not a C conference, it’s a database conference,
or an embedded conference, or a Linux conference, or a BSD conference,
etc.</p>

<p>Second, C has a tendency to be conservative, changing and growing very
slowly. This is a feature, and one that is often undervalued by
developers. (In fact, I’d personally like to see a future revision
that makes the C language specification <em>smaller</em> and <em>simpler</em>,
rather than accumulate more features.) The last major revision to C
happened in 1999 (C99). There was a minor revision in 2011 (C11), and
an even smaller revision in 2018 (C17). If there was a C conference,
recent changes to the language wouldn’t be a very fruitful topic.</p>

<p>However, the <em>tooling</em> has advanced significantly in recent years,
especially with the advent of LLVM and Clang. This is largely driven
by the C++ community, and C has significantly benefited as a side
effect due to its overlap. Those are topics worthy of conferences, but
these are really C++ conferences.</p>

<p>The closest thing we have to a C conference every year is CppCon. A
lot of CppCon isn’t <em>really</em> just about C++, and the subjects of many
of the talks are easily applied to C, since C++ builds so much upon C.
In a sense, <strong>a subset of CppCon could be considered a C conference</strong>.
That’s what I’m looking for when I watch the CppCon presentations each
year on YouTube.</p>

<p>Starting last year, I began a list of all the talks that I thought
would be useful to C programmers. Some are entirely relevant to C,
others just have significant portions that are relevant to C. When
someone asks about where they can find a C conference, I send them my
list.</p>

<p>I’m sharing them here so you can bookmark this page and never return
again.</p>

<h3 id="2017">2017</h3>

<p>Here’s the list for CppCon 2017. These are <em>roughly</em> ordered from
highest to lowest recommendation:</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=bSkpMdDe4g4">Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler’s Lid”</a></li>
  <li><a href="https://www.youtube.com/watch?v=2EWejmkKlxs">Chandler Carruth “Going Nowhere Faster”</a></li>
  <li><a href="https://www.youtube.com/watch?v=JPQWQfDhICA">James McNellis “Everything You Ever Wanted to Know about DLLs”</a></li>
  <li><a href="https://www.youtube.com/watch?v=v1COuU2vU_w">John Regehr “Undefined Behavior in 2017 (part 1 of 2)”</a></li>
  <li><a href="https://www.youtube.com/watch?v=TPyLrJED0zQ">John Regehr “Undefined Behavior in 2017 (part 2 of 2)”</a></li>
  <li><a href="https://www.youtube.com/watch?v=ehyHyAIa5so">Piotr Padlewski “Undefined Behaviour is awesome!” </a></li>
  <li><a href="https://www.youtube.com/watch?v=1HqY9dPccMI">Tobias Fuchs “Multidimensional Index Sets for Data Locality in HPC Applications”</a></li>
  <li><a href="https://www.youtube.com/watch?v=iJ1rwgCI1Xc">John D. Woolverton “C Pointers”</a></li>
  <li><a href="https://www.youtube.com/watch?v=IfUPkUAEwrk">Charles Bailey “Enough x86 Assembly to Be Dangerous”</a></li>
  <li><a href="https://www.youtube.com/watch?v=HP2InVqgBFM">Tony Van Eerd “An Interesting Lock-free Queue - Part 2 of N”</a></li>
  <li><a href="https://www.youtube.com/watch?v=ncHmEUmJZf4">Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”</a></li>
  <li><a href="https://www.youtube.com/watch?v=rxQ5K9lo034">Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”</a></li>
  <li><a href="https://www.youtube.com/watch?v=GNw3RXr-VJk">Ansel Sermersheim “Multithreading is the answer. What is the question? (part 1 of 2)”</a></li>
  <li><a href="https://www.youtube.com/watch?v=sDLQWivf1-I">Ansel Sermersheim “Multithreading is the answer. What is the question? (part 2 of 2)”</a></li>
  <li><a href="https://www.youtube.com/watch?v=NH1Tta7purM">Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”</a></li>
  <li><a href="https://www.youtube.com/watch?v=h7D88U-5pKc">Alfred Bratterud “Deconstructing the OS: The devil’s In the side effects”</a> (<a href="https://www.joyent.com/blog/unikernels-are-unfit-for-production">counterpoint</a>)</li>
  <li><a href="https://www.youtube.com/watch?v=YM8Xy6oKVQg">P. McKenney, M. Michael &amp; M. Wong “Is Parallel Programming still hard? PART 1 of 2”</a></li>
  <li><a href="https://www.youtube.com/watch?v=74QjNwYAJ7M">P. McKenney, M. Michael &amp; M. Wong “Is Parallel Programming still hard? PART 2 of 2”</a></li>
  <li><a href="https://www.youtube.com/watch?v=6wJ4-wP-nnA">Adrien Devresse “Nix: A functional package manager for your C++ software stack”</a></li>
  <li><a href="https://www.youtube.com/watch?v=Ia3IDPjA-d0">Mathieu Ropert “API &amp; ABI Versioning…”</a></li>
  <li><a href="https://www.youtube.com/watch?v=CVAVKIe7CnY">Paul Blinzer “Heterogeneous System Architecture - Why Should You Care?”</a></li>
  <li><a href="https://www.youtube.com/watch?v=eC9-iRN2b04">Mathieu Ropert “Using Modern CMake Patterns to Enforce a Good Modular Design”</a></li>
  <li><a href="https://www.youtube.com/watch?v=-8UZhDjgeZU">Allan Deutsch “Esoteric Data Structures and Where to Find Them”</a></li>
  <li><a href="https://www.youtube.com/watch?v=xA9yRX4Mdz0">D. Rodriguez-Losada Gonzalez “Faster Delivery of Large C/C++ Projects with…”</a></li>
</ul>

<h3 id="2018">2018</h3>

<p>The final CppCon 2018 videos were uploaded this week, so my 2018
listing can be complete:</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=sBP17HQAQjk">Robert Schumacher “Don’t package your libraries, write packagable libraries!”</a></li>
  <li><a href="https://www.youtube.com/watch?v=yy8jQgmhbAU">Stoyan Nikolov “OOP Is Dead, Long Live Data-oriented Design”</a></li>
  <li><a href="https://www.youtube.com/watch?v=m25p3EtBua4">Fedor Pikus “Design for Performance”</a></li>
  <li><a href="https://www.youtube.com/watch?v=JhUxIVf1qok">JF Bastien “Signed integers are two’s complement”</a></li>
  <li><a href="https://www.youtube.com/watch?v=EovBkh9wDnM">Alan Talbot “Moving Faster: Everyday efficiency in modern C++”</a></li>
  <li><a href="https://www.youtube.com/watch?v=s5PCh_FaMfM">Geoffrey Romer “What do you mean “thread-safe”?”</a></li>
  <li><a href="https://www.youtube.com/watch?v=_f7O3IfIR2k">Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security”</a></li>
  <li><a href="https://www.youtube.com/watch?v=5FQ87-Ecb-A">Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”</a></li>
  <li><a href="https://www.youtube.com/watch?v=8nyq8SNUTSc">Nir Friedman “Understanding Optimizers: Helping the Compiler Help You”</a></li>
  <li><a href="https://www.youtube.com/watch?v=XEXpwis_deQ">Barbara Geller &amp; Ansel Sermersheim “Undefined Behavior is Not an Error” </a></li>
  <li><a href="https://www.youtube.com/watch?v=dOfucXtyEsU">Matt Godbolt “The Bits Between the Bits: How We Get to main()”</a></li>
  <li><a href="https://www.youtube.com/watch?v=DMNFb5ycpNY">Mike Shah “Let’s Intercept OpenGL Function Calls…for Logging!”</a></li>
  <li><a href="https://www.youtube.com/watch?v=lLEcbXidK2o">Kostya Serebryany “Memory Tagging and how it improves C/C++ memory safety”</a></li>
  <li><a href="https://www.youtube.com/watch?v=0S0QgQd75Sw">Patricia Aas “Software Vulnerabilities in C and C++” </a></li>
  <li><a href="https://www.youtube.com/watch?v=IupP8AFrOJk">Patricia Aas “Make It Fixable: Preparing for Security Vulnerability Reports”</a></li>
  <li><a href="https://www.youtube.com/watch?v=V1t6faOKjuQ">Greg Law “Debugging Linux C++”</a></li>
  <li><a href="https://www.youtube.com/watch?v=0DDrseUomfU">Simon Brand “How C++ Debuggers Work”</a></li>
  <li><a href="https://www.youtube.com/watch?v=gcRdG7dGMOw">Odin Holmes “Concurrency Challenges of Interrupt Service Routines”</a></li>
</ul>

<p>There were three talks strictly about C++ that I thought were
interesting from a language design perspective. So I think they’re
worth recommending, too. (In fact, they’re a sort of ammo <em>against</em>
using C++ due to its insane complexity.)</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=7DTlWPgX6zs">Nicolai Josuttis “The Nightmare of Initialization in C++”</a></li>
  <li><a href="https://www.youtube.com/watch?v=tsG95Y-C14k">Timur Doumler “Can I has grammar?”</a></li>
  <li><a href="https://www.youtube.com/watch?v=ZbVCGCy3mGQ">Richard Powell “How to Argue(ment)”</a></li>
</ul>

<h3 id="2019">2019</h3>

<p>Only three this year. The last is about C++, but I thought it was
interesting.</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=KJW_DLaVXIY">JF Bastien “Deprecating volatile”</a></li>
  <li><a href="https://www.youtube.com/watch?v=rQWjF8NvqAU">J. Bialek, S. Block “Killing Uninitialized Memory: Protecting the OS Without Destroying Performance”</a></li>
  <li><a href="https://www.youtube.com/watch?v=HG6c4Kwbv4I">Matt Godbolt “Path Tracing Three Ways: A Study of C++ Style”</a></li>
</ul>

<h3 id="2020">2020</h3>

<p>Four more worthwhile talks in 2020. The first is about the C++ abstract
machine, but is nearly identical to the C abstract machine. The second is
a proverbial warning about builds. The rest are about performance, and
while the context is C++ the concepts are entirely applicable to C.</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=ZAji7PkXaKY">Bob Steagall “Back to Basics - The Abstract Machine”</a></li>
  <li><a href="https://www.youtube.com/watch?v=54uVTkhinDE">Dave Steffen “Build Everything From Source - A Case Study in Fear”</a></li>
  <li><a href="https://www.youtube.com/watch?v=qejTqnxQRcw">Bob Steagall “Adventures in SIMD-Thinking”</a> (<a href="https://www.youtube.com/watch?v=qXleSwCCEvY">part 2</a>)</li>
  <li><a href="https://www.youtube.com/watch?v=1ir_nEfKQ7A">Alex Reinking “Halide - A Language for Fast, Portable Computation on Images and Tensors”</a></li>
</ul>

<h3 id="2021-and-2022">2021 and 2022</h3>

<p>CppCon’s current sponsor interferes with scheduling and video releases,
deliberately reducing accessibility to the outside (unlisted videos,
uploading talks multiple times, etc.). Since it’s too time consuming to
track it all myself, I’ve given up on following CppCon, at least until
they get better-behaved sponsor.</p>

<h3 id="bonus">Bonus</h3>

<p>Finally, here are a few more good presentations from other C++
conferences which you can just pretend are about C:</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=yG1OZ69H_-o">CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior…”</a></li>
  <li><a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">CppCon 2014: Mike Acton “Data-Oriented Design and C++”</a> (this is a personal favorite of mine)</li>
  <li><a href="https://www.youtube.com/watch?v=fHNmRkzxHWs">CppCon 2014: Chandler Carruth “Efficiency with Algorithms, Performance with Data Structures”</a></li>
  <li><a href="https://www.youtube.com/watch?v=nXaxk27zwlk">CppCon 2015: Chandler Carruth “Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!”</a></li>
  <li><a href="https://www.youtube.com/watch?v=vElZc6zSIXM">CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures”</a></li>
  <li><a href="https://www.youtube.com/watch?v=FnGCDLhaxKU">Meeting Cpp 2015: Chandler Carruth: Understanding Compiler Optimization</a></li>
  <li><a href="https://www.youtube.com/watch?v=eR34r7HOU14">BoostCon 2013: Chandler Carruth: Optimizing the Emergent Structures of C++ </a></li>
  <li><a href="https://www.youtube.com/watch?v=kPR8h4-qZdk">CppCon 2016: Nicholas Ormrod “The strange details of std::string at Facebook” </a></li>
  <li><a href="https://vimeo.com/644068002">Handmade Seattle 2021: Context is Everything: Andreas Fredriksson</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A JIT Compiler Skirmish with SELinux</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/11/15/"/>
    <id>urn:uuid:d4fa35ad-05c3-3b86-1083-d533dfacfb15</id>
    <updated>2018-11-15T18:57:47Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>This is a debugging war story.</p>

<p>Once upon a time I wrote a fancy data conversion utility. The input
was a complex binary format defined by a data dictionary supplied at
run time by the user alongside the input data. Since the converter was
typically used to process massive quantities of input, and the nature
of that input wasn’t known until run time, I wrote <a href="/blog/2015/03/19/">an x86-64 JIT
compiler</a> to speed it up. The converter generated a fast, native
binary parser in memory according to the data dictionary
specification. Processing data now took much less time and everyone
rejoiced.</p>

<p>Then along came SELinux, Sheriff of Pedantry. Not liking all the
shenanigans with page protections, SELinux huffed and puffed and made
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> return <code class="language-plaintext highlighter-rouge">EACCES</code> (“Permission denied”). Believing I was
following all the rules and so this would never happen, I foolishly
did not check the result and the converter was now crashing for its
users. What made SELinux so unhappy, and could this somehow be
resolved?</p>

<h3 id="allocating-memory">Allocating memory</h3>

<p>Before going further, let’s back up and review how this works. Suppose I
want to generate code at run time and execute it. In the old days this
was as simple as writing some machine code into a buffer and jumping to
that buffer — e.g. by converting the buffer to a function pointer and
calling it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="nf">int</span> <span class="p">(</span><span class="o">*</span><span class="n">jit_func</span><span class="p">)(</span><span class="kt">void</span><span class="p">);</span>

<span class="cm">/* NOTE: This doesn't work anymore! */</span>
<span class="n">jit_func</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="kt">int</span> <span class="n">retval</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">6</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* mov eax, retval */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xb8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="n">retval</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
        <span class="cm">/* ret */</span>
        <span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xc3</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">jit_func</span> <span class="n">f</span> <span class="o">=</span> <span class="n">jit_compile</span><span class="p">(</span><span class="mi">1001</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"f() = %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">f</span><span class="p">());</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This situation was far too easy for malicious actors to abuse. An
attacker could supply instructions of their own choosing — i.e. <em>shell
code</em> — as input and exploit a buffer overflow vulnerability to execute
the input buffer. These exploits were trivial to craft.</p>

<p>Modern systems have hardware checks to prevent this from happening.
Memory containing instructions must have their execute protection bit
set before those instructions can be executed. This is useful both for
making attackers work harder and for catching bugs in programs — no more
executing data by accident.</p>

<p>This is further complicated by the fact that memory protections have
page granularity. You can’t adjust the protections for a 6-byte
buffer. You do it for the entire surrounding page — typically 4kB, but
sometimes as large as 2MB. This requires replacing that <code class="language-plaintext highlighter-rouge">malloc(3)</code>
with a more careful allocation strategy. There are a few ways to go
about this.</p>

<h4 id="anonymous-memory-mapping">Anonymous memory mapping</h4>

<p>The most common and most sensible is to create an anonymous memory
mapping: a file memory map that’s not actually backed by a file. The
<code class="language-plaintext highlighter-rouge">mmap(2)</code> function has a flag specifically for this purpose:
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">anon_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately, <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> not part of POSIX. If you’re being super
strict with your includes — <a href="/blog/2017/03/30/">as I tend to be</a> — this flag won’t be
defined, even on systems where it’s supported.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span><span class="c1">// MAP_ANONYMOUS undefined!</span>
</code></pre></div></div>

<p>To get the flag, you must use the <code class="language-plaintext highlighter-rouge">_BSD_SOURCE</code>, or, more recently,
the <code class="language-plaintext highlighter-rouge">_DEFAULT_SOURCE</code> feature test macro to explicitly enable that
feature.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE </span><span class="cm">/* for MAP_ANONYMOUS */</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></code></pre></div></div>

<p>The POSIX way to do this is to instead map <code class="language-plaintext highlighter-rouge">/dev/zero</code>. <strong>So, wanting to
be Mr. Portable, this is what I did in my tool.</strong> Take careful note of
this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"/dev/zero"</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="aligned-allocation">Aligned allocation</h4>

<p>Another, less common (and less portable) strategy is to lean on the
existing C memory allocator, being careful to allocate on page
boundaries so that the page protections don’t affect other allocations.
The classic allocation functions, like <code class="language-plaintext highlighter-rouge">malloc(3)</code>, don’t allow for this
kind of control. However, there are a couple of aligned allocation
alternatives.</p>

<p>The first is <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">posix_memalign</span><span class="p">(</span><span class="kt">void</span> <span class="o">**</span><span class="n">ptr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">alignment</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>By choosing page alignment and a size that’s a multiple of the page
size, it’s guaranteed to return whole pages. When done, pages are freed
with <code class="language-plaintext highlighter-rouge">free(3)</code>. Though, unlike unmapping, the original page protections
must first be restored since those pages may be reused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _POSIX_C_SOURCE 200112L
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="o">*</span>
<span class="nf">anon_alloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">pagesize</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGE_SIZE</span><span class="p">);</span> <span class="c1">// TODO: cache this</span>
    <span class="kt">size_t</span> <span class="n">roundup</span> <span class="o">=</span> <span class="p">(</span><span class="n">len</span> <span class="o">+</span> <span class="n">pagesize</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">pagesize</span> <span class="o">*</span> <span class="n">pagesize</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">posix_memalign</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="n">pagesize</span><span class="p">,</span> <span class="n">roundup</span><span class="p">)</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re using C11, there’s also <code class="language-plaintext highlighter-rouge">aligned_alloc(3)</code>. This is the most
uncommon of all since most C programmers refuse to switch to a new
standard until it’s at least old enough to drive a car.</p>

<h3 id="changing-page-protections">Changing page protections</h3>

<p>So we’ve allocated our memory, but it’s not going to start in an
executable state. Why? Because a <a href="https://en.wikipedia.org/wiki/W%5EX">W^X</a> (“write xor execute”)
policy is becoming increasingly common. Attempting to set both write
and execute protections at the same time may be denied. (In fact,
there’s an SELinux policy for this.)</p>

<p>As a JIT compiler, we need to write to a page <em>and</em> execute it. Again,
there are two strategies. The complicated strategy is to <a href="/blog/2016/04/10/">map the same
memory at two different places</a>, one with the execute protection,
one with the write protection. This allows the page to be modified as
it’s being executed without violating W^X.</p>

<p>The simpler and more secure strategy is to write the machine
instructions, then swap the page over to executable using <code class="language-plaintext highlighter-rouge">mprotect(2)</code>
once it’s ready. This is what I was doing in my tool.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">anon_alloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="cm">/* ... write instructions into the buffer ... */</span>
<span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="n">jit_func</span> <span class="n">func</span> <span class="o">=</span> <span class="p">(</span><span class="n">jit_func</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="n">func</span><span class="p">();</span>
</code></pre></div></div>

<p>At a high level, That’s pretty close to what I was actually doing. That
includes neglecting to check the result of <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. This worked
fine and dandy for several years, when suddenly (shown here in the style
<a href="/blog/2018/06/23/">of strace</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)
</code></pre></div></div>

<p>Then the program would crash trying to execute the buffer. Suddenly it
wasn’t allowed to make this buffer executable. My program hadn’t
changed. What <em>had</em> changed was the SELinux security policy on this
particular system.</p>

<h3 id="asking-for-help">Asking for help</h3>

<p>The problem is that I don’t administer this (Red Hat) system. I can’t
access the logs and I didn’t set the policy. I don’t have any insight
on <em>why</em> this call was suddenly being denied. To make this more
challenging, the folks that manage this system didn’t have the
necessary knowledge to help with this either.</p>

<p>So to figure this out, I need to treat it like a black box and probe
at system calls until I can figure out just what SELinux policy I’m up
against. I only have practical experience administrating Debian
systems (and its derivatives like Ubuntu), which means I’ve hardly
ever had to deal with SELinux. I’m flying fairly blind here.</p>

<p>Since my real application is large and complicated, I code up a
minimal example, around a dozen lines of code: allocate a single page
of memory, write a single return (<code class="language-plaintext highlighter-rouge">ret</code>) instruction into it, set it
as executable, and call it. The program checks for errors, and I can
run it under strace if that’s not insightful enough. This program is
also something simple I could provide to the system administrators,
since they were willing to turn some of the knobs to help narrow down
the problem.</p>

<p>However, <strong>here’s where I made a major mistake</strong>. Assuming the problem
was solely in <code class="language-plaintext highlighter-rouge">mprotect(2)</code>, and wanting to keep this as absolutely
simple as possible, I used <code class="language-plaintext highlighter-rouge">posix_memalign(3)</code> to allocate that page. I
saw the same <code class="language-plaintext highlighter-rouge">EACCES</code> as before, and assumed I was demonstrating the
same problem. Take note of this, too.</p>

<h3 id="finding-a-resolution">Finding a resolution</h3>

<p>Eventually I’d need to figure out what policy was blocking my JIT
compiler, then see if there was an alternative route. The system
loader still worked after all, and I could plainly see that with
strace. So it wasn’t a blanket policy that completely blocked the
execute protection. Perhaps the loader was given an exception?</p>

<p>However, the very first order of business was to actually check the
result from <code class="language-plaintext highlighter-rouge">mprotect(2)</code> and do something more graceful rather than
crash. In my case, that meant falling back to executing a byte-code
virtual machine. I added the check, and now the program ran slower
instead of crashing.</p>

<p>The program runs on both Linux and Windows, and the allocation and
page protection management is abstracted. On Windows it uses
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> and <code class="language-plaintext highlighter-rouge">VirtualProtect()</code> instead of <code class="language-plaintext highlighter-rouge">mmap(2)</code> and
<code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Neither implementation checked that the protection
change succeeded, so I fixed the Windows implementation while I was at
it.</p>

<p>Thanks to Mingw-w64, I actually do most of my <a href="/blog/2016/06/13/">Windows
development</a> on Linux. And, thanks to <a href="https://www.winehq.org/">Wine</a>, I mean
everything, including running and debugging. Calling
<code class="language-plaintext highlighter-rouge">VirtualProtect()</code> in Wine would ultimately call <code class="language-plaintext highlighter-rouge">mprotect(2)</code> in the
background, which I expected would be denied. So running the Windows
version with Wine under this SELinux policy would be the perfect test.
Right?</p>

<p><strong>Except that <code class="language-plaintext highlighter-rouge">mprotect(2)</code> succeeded under Wine!</strong> The Windows version
of my JIT compiler was working just fine on Linux. Huh?</p>

<p>This system doesn’t have Wine installed. I had built <a href="/blog/2018/03/27/">and packaged it
myself</a>. This Wine build definitely has no SELinux exceptions.
Not only did the Wine loader work correctly, it can change page
protections in ways my own Linux programs could not. What’s different?</p>

<p>Debugging this with all these layers is starting to look silly, but
this is exactly why doing Windows development on Linux is so useful. I
run my program under Wine under strace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace wine ./mytool.exe
</code></pre></div></div>

<p>I study the system calls around <code class="language-plaintext highlighter-rouge">mprotect(2)</code>. Perhaps there’s some
stricter alignment issue? No. Perhaps I need to include <code class="language-plaintext highlighter-rouge">PROT_READ</code>?
No. The only difference I can find is they’re using the
<code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag. So, armed with this knowledge, <strong>I modify my
minimal example to allocate 1024 pages instead of just one, and
suddenly it works correctly</strong>. I was most of the way to figuring this
all out.</p>

<h3 id="inside-glibc-allocation">Inside glibc allocation</h3>

<p>Why did increasing the allocation size change anything? This is a
typical Linux system, so my program is linked against the GNU C
library, glibc. This library allocates memory from two places
depending on the allocation size.</p>

<p>For small allocations, glibc uses <code class="language-plaintext highlighter-rouge">brk(2)</code> to extend the executable
image — i.e. to extend the <code class="language-plaintext highlighter-rouge">.bss</code> section. These resources are not
returned to the operating system after they’re freed with <code class="language-plaintext highlighter-rouge">free(3)</code>.
They’re reused.</p>

<p>For large allocations, glibc uses <code class="language-plaintext highlighter-rouge">mmap(2)</code> to create a new, anonymous
mapping for that allocation. When freed with <code class="language-plaintext highlighter-rouge">free(3)</code>, that memory is
unmapped and its resources are returned to the operating system.</p>

<p>By increasing the allocation size, it became a “large” allocation and
was backed by an anonymous mapping. Even though I didn’t use <code class="language-plaintext highlighter-rouge">mmap(2)</code>,
to the operating system this would be indistinguishable to what Wine was
doing (and succeeding at).</p>

<p>Consider this little example program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">malloc</span><span class="p">(</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When <em>not</em> compiled as a Position Independent Executable (PIE), here’s
what the output looks like. The first pointer is near where the program
was loaded, low in memory. The second pointer is a randomly selected
address high in memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x1077010
0x7fa9b998e010
</code></pre></div></div>

<p>And if you run it under strace, you’ll see that the first allocation
comes from <code class="language-plaintext highlighter-rouge">brk(2)</code> and the second comes from <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<h3 id="two-selinux-policies">Two SELinux policies</h3>

<p>With a little bit of research, I found the <a href="https://akkadia.org/drepper/selinux-mem.html">two SELinux policies</a>
at play here. In my minimal example, I was blocked by <code class="language-plaintext highlighter-rouge">allow_execheap</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execheap
</code></pre></div></div>

<p>This prohibits programs from setting the execute protection on any
“heap” page.</p>

<blockquote>
  <p>The POSIX specification does not permit it, but the Linux
implementation of <code class="language-plaintext highlighter-rouge">mprotect</code> allows changing the access protection of
memory on the heap (e.g., allocated using <code class="language-plaintext highlighter-rouge">malloc</code>). This error
indicates that heap memory was supposed to be made executable. Doing
this is really a bad idea. If anonymous, executable memory is needed
it should be allocated using <code class="language-plaintext highlighter-rouge">mmap</code> which is the only portable
mechanism.</p>
</blockquote>

<p>Obviously this is pretty loose since I was still able to do it with
<code class="language-plaintext highlighter-rouge">posix_memalign(3)</code>, which, technically speaking, allocates from the
heap. So this policy applies to pages mapped by <code class="language-plaintext highlighter-rouge">brk(2)</code>.</p>

<p>The second policy was <code class="language-plaintext highlighter-rouge">allow_execmod</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/selinux/booleans/allow_execmod
</code></pre></div></div>

<blockquote>
  <p>The program mapped from a file with <code class="language-plaintext highlighter-rouge">mmap</code> and the <code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> flag
and write permission. Then the memory region has been written to,
resulting in copy-on-write (COW) of the affected page(s). This memory
region is then made executable […]. The <code class="language-plaintext highlighter-rouge">mprotect</code> call will fail
with <code class="language-plaintext highlighter-rouge">EACCES</code> in this case.</p>
</blockquote>

<p>I don’t understand what purpose this policy serves, but this is what
was causing my original problem. Pages mapped to <code class="language-plaintext highlighter-rouge">/dev/zero</code> are not
<em>actually</em> considered anonymous by Linux, at least as far as this
policy is concerned. I think this is a mistake, and that mapping the
special <code class="language-plaintext highlighter-rouge">/dev/zero</code> device should result in effectively anonymous
pages.</p>

<p>From this I learned a little lesson about baking assumptions — that
<code class="language-plaintext highlighter-rouge">mprotect(2)</code> was solely at fault — into my minimal debugging examples.
And the fix was ultimately easy: I just had to suck it up and use the
slightly less pure <code class="language-plaintext highlighter-rouge">MAP_ANONYMOUS</code> flag.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Prospecting for Hash Functions</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/07/31/"/>
    <id>urn:uuid:e865266a-2896-30c5-3f7d-cfad767b1ae2</id>
    <updated>2018-07-31T22:32:45Z</updated>
    <category term="c"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update 2022</em>: <a href="https://github.com/skeeto/hash-prospector/issues/19">TheIronBorn has found even better permutations</a> using
a smarter technique. That thread completely eclipses my efforts in this
article.</p>

<p>I recently got an itch to design my own non-cryptographic integer hash
function. Firstly, I wanted to <a href="/blog/2017/09/15/">better understand</a> how hash
functions work, and the best way to learn is to do. For years I’d been
treating them like magic, shoving input into it and seeing
<a href="/blog/2018/02/07/">random-looking</a>, but deterministic, output come out the other
end. Just how is the avalanche effect achieved?</p>

<p>Secondly, could I apply my own particular strengths to craft a hash
function better than the handful of functions I could find online?
Especially the classic ones from <a href="https://gist.github.com/badboy/6267743">Thomas Wang</a> and <a href="http://burtleburtle.net/bob/hash/integer.html">Bob
Jenkins</a>. Instead of struggling with the mathematics, maybe I
could software engineer my way to victory, working from the advantage
of access to the excessive computational power of today.</p>

<p>Suppose, for example, I wrote tool to generate a <strong>random hash
function definition</strong>, then <strong>JIT compile it</strong> to a native function in
memory, then execute that function across various inputs to <strong>evaluate
its properties</strong>. My tool could rapidly repeat this process in a loop
until it stumbled upon an incredible hash function the world had never
seen. That’s what I actually did. I call it the <strong>Hash Prospector</strong>:</p>

<p><strong><a href="https://github.com/skeeto/hash-prospector">https://github.com/skeeto/hash-prospector</a></strong></p>

<p>It only works on x86-64 because it uses the same <a href="/blog/2015/03/19/">JIT compiling
technique I’ve discussed before</a>: allocate a page of memory, write
some machine instructions into it, set the page to executable, cast the
page pointer to a function pointer, then call the generated code through
the function pointer.</p>

<h3 id="generating-a-hash-function">Generating a hash function</h3>

<p>My focus is on integer hash functions: a function that accepts an
<em>n</em>-bit integer and returns an <em>n</em>-bit integer. One of the important
properties of an <em>integer</em> hash function is that it maps its inputs to
outputs 1:1. In other words, there are <strong>no collisions</strong>. If there’s a
collision, then some outputs aren’t possible, and the function isn’t
making efficient use of its entropy.</p>

<p>This is actually a lot easier than it sounds. As long as every <em>n</em>-bit
integer operation used in the hash function is <em>reversible</em>, then the
hash function has this property. An operation is reversible if, given
its output, you can unambiguously compute its input.</p>

<p>For example, XOR with a constant is trivially reversible: XOR the
output with the same constant to reverse it. Addition with a constant
is reversed by subtraction with the same constant. Since the integer
operations are modular arithmetic, modulo 2^n for <em>n</em>-bit integers,
multiplication by an <em>odd</em> number is reversible. Odd numbers are
coprime with the power-of-two modulus, so there is some <em>modular
multiplicative inverse</em> that reverses the operation.</p>

<p><a href="http://papa.bretmulvey.com/post/124027987928/hash-functions">Bret Mulvey’s hash function article</a> provides a convenient list
of some reversible operations available for constructing integer hash
functions. This list was the catalyst for my little project. Here are
the ones used by the hash prospector:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span>  <span class="o">=</span> <span class="o">~</span><span class="n">x</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">*=</span> <span class="n">constant</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// e.g. only odd constants</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">;</span>
<span class="n">x</span> <span class="o">&lt;&lt;&lt;=</span> <span class="n">constant</span><span class="p">;</span> <span class="c1">// left rotation</span>
</code></pre></div></div>

<p>I’ve come across a couple more useful operations while studying existing
integer hash functions, but I didn’t put these in the prospector.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hash</span> <span class="o">+=</span> <span class="o">~</span><span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">);</span>
<span class="n">hash</span> <span class="o">-=</span> <span class="o">~</span><span class="p">(</span><span class="n">hash</span> <span class="o">&lt;&lt;</span> <span class="n">constant</span><span class="p">);</span>
</code></pre></div></div>

<p>The prospector picks some operations at random and fills in their
constants randomly within their proper constraints. For example,
here’s an awful hash function I made it generate as an example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// do NOT use this!</span>
<span class="kt">uint32_t</span>
<span class="nf">badhash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x1eca7d79U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">;</span>
    <span class="n">x</span>  <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">);</span>
    <span class="n">x</span>  <span class="o">=</span> <span class="o">~</span><span class="n">x</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">5</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">+=</span> <span class="mh">0x10afe4e7U</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That function is reversible, and it would be <a href="https://naml.us/post/inverse-of-a-hash-function/">relatively
straightforward</a> to <a href="http://c42f.github.io/2015/09/21/inverting-32-bit-wang-hash.html">define its inverse</a>. However, it has
awful biases and poor avalanche. How do I know this?</p>

<h3 id="the-measure-of-a-hash-function">The measure of a hash function</h3>

<p>There are two key properties I’m looking for in randomly generated hash
functions.</p>

<ol>
  <li>
    <p>High avalanche effect. When I flip one input bit, the output bits
should each flip with a 50% chance.</p>
  </li>
  <li>
    <p>Low bias. Ideally there is no correlation between which output bits
flip for a particular flipped input bit.</p>
  </li>
</ol>

<p>Initially I screwed up and only measured the first property. This lead
to some hash functions that <em>seemed</em> to be amazing before close
inspection, since, for a 32-bit hash function, it was flipping over 15
output bits on average. However, the particular bits being flipped
were heavily biased, resulting in obvious patterns in the output.</p>

<p>For example, when hashing a counter starting from zero, the high bits
would follow a regular pattern. 15 to 16 bits were being flipped each
time, but it was always the same bits.</p>

<p>Conveniently it’s easy to measure both properties at the same time. For
an <em>n</em>-bit integer hash function, create an <em>n</em> by <em>n</em> table initialized
to zero. The rows are input bits and the columns are output bits. The
<em>i</em>th row and <em>j</em>th column track the correlation between the <em>i</em>th input
bit and <em>j</em>th output bit.</p>

<p>Then exhaustively iterate over all 2^n inputs, and flip each bit one at
a time. Increment the appropriate element in the table if the output bit
flips.</p>

<p>When you’re done, ideally each element in the table is exactly 2^(n-1).
That is, each output bit was flipped exactly half the time by each input
bit. Therefore the <em>bias</em> of the hash function is the distance (the
error) of the computed table from the ideal table.</p>

<p>For example, the ideal bias table for an 8-bit hash function would be:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
</code></pre></div></div>

<p>The hash prospector computes the standard deviation in order to turn
this into a single, normalized measurement. Lower scores are better.</p>

<p>However, there’s still one problem: the input space for a 32-bit hash
function is over 4 billion values. The full test takes my computer about
an hour and a half. Evaluating a 64-bit hash function is right out.</p>

<p>Again, <a href="/blog/2017/09/21/">Monte Carlo to the rescue</a>! Rather than sample the entire
space, just sample a random subset. This provides a good estimate in
less than a second, allowing lots of terrible hash functions to be
discarded early. The full test can be saved only for the known good
32-bit candidates. 64-bit functions will only ever receive the estimate.</p>

<h3 id="what-did-i-find">What did I find?</h3>

<p>Once I got the bias issue sorted out, and after hours and hours of
running, followed up with some manual tweaking on my part, the
<strong>prospector stumbled across this little gem</strong>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// DO use this one!</span>
<span class="kt">uint32_t</span>
<span class="nf">prospector32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x2c1b3c6dU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x297a2d39U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>According to a full (e.g. not estimated) bias evaluation, this function
beats <em>the snot</em> out of most of 32-bit hash functions I could find. It
even comes out ahead of this well known hash function that I <em>believe</em>
originates from the H2 SQL Database. (Update: Thomas Mueller has
confirmed that, indeed, this is his hash function.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">^</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s still an excellent hash function, just slightly more biased than
mine.</p>

<p>Very briefly, <code class="language-plaintext highlighter-rouge">prospector32()</code> was the best 32-bit hash function I could
find, and I thought I had a major breakthrough. Then I noticed the
finalizer function for <a href="https://en.wikipedia.org/wiki/MurmurHash#Algorithm">the 32-bit variant of MurmurHash3</a>. It’s
also a 32-bit hash function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">murmurhash32_mix32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x85ebca6bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">13</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xc2b2ae35U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This one is just <em>barely</em> less biased than mine. So I still haven’t
discovered the best 32-bit hash function, only the <em>second</em> best one.
:-)</p>

<h3 id="a-pattern-emerges">A pattern emerges</h3>

<p>If you’re paying close enough attention, you may have noticed that all
three functions above have the same structure. The prospector had
stumbled upon it all on its own without knowledge of the existing
functions. It may not be so obvious for the second function, but here it
is refactored:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">hash32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x45d9f3bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I hadn’t noticed this until after the prospector had come across it on
its own. The pattern for all three is XOR-right-shift, multiply,
XOR-right-shift, multiply, XOR-right-shift. There’s something
particularly useful about this <a href="http://www.pcg-random.org/posts/developing-a-seed_seq-alternative.html#multiplyxorshift">multiply-xorshift construction</a>
(<a href="http://ticki.github.io/blog/designing-a-good-non-cryptographic-hash-function/#designing-a-diffusion-function--by-example">also</a>). The XOR-right-shift diffuses bits rightward and the
multiply diffuses bits leftward. I like to think it’s “sloshing” the
bits right, left, right, left.</p>

<p>It seems that multiplication is particularly good at diffusion, so it
makes perfect sense to exploit it in non-cryptographic hash functions,
especially since modern CPUs are so fast at it. Despite this, it’s not
used much in cryptography due to <a href="http://cr.yp.to/snuffle/design.pdf">issues with completing it in constant
time</a>.</p>

<p>I like to think of this construction in terms of a five-tuple. For the
three functions it’s the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()
</code></pre></div></div>

<p>The prospector actually found lots of decent functions following this
pattern, especially where the middle shift is smaller than the outer
shift. Thinking of it in terms of this tuple, I specifically directed
it to try different tuple constants. That’s what I meant by
“tweaking.” Eventually my new function popped out with its really low
bias.</p>

<p>The prospector has a template option (<code class="language-plaintext highlighter-rouge">-p</code>) if you want to try it
yourself:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr,mul,xorr,mul,xorr
</code></pre></div></div>

<p>If you really have your heart set on certain constants, such as my
specific selection of shifts, you can lock those in while randomizing
the other constants:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15
</code></pre></div></div>

<p>Or the other way around:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr
</code></pre></div></div>

<p>My function seems a little strange using shifts of 15 bits rather than
a nice, round 16 bits. However, changing those constants to 16
increases the bias. Similarly, neither of the two 32-bit constants is
a prime number, but <strong>nudging those constants to the nearest prime
increases the bias</strong>. These parameters really do seem to be a local
minima in the bias, and using prime numbers isn’t important.</p>

<h3 id="what-about-64-bit-integer-hash-functions">What about 64-bit integer hash functions?</h3>

<p>So far I haven’t been able to improve on 64-bit hash functions. The main
function to beat is SplittableRandom / <a href="http://xoshiro.di.unimi.it/splitmix64.c">SplitMix64</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">splittable64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xbf58476d1ce4e5b9U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x94d049bb133111ebU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">31</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here’s its inverse since it’s sometimes useful:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">splittable64_r</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">31</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">62</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x319642b2d24d8ec3U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">54</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x96de1b173f119089U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">60</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I also came across <a href="https://gist.github.com/degski/6e2069d6035ae04d5d6f64981c995ec2">this function</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">hash64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd6e8feb86659fd93U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xd6e8feb86659fd93U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Again, these follow the same construction as before. There really is
something special about it, and many other people have noticed, too.</p>

<p>Both functions have about the same bias. (Remember, I can only estimate
the bias for 64-bit hash functions.) The prospector has found lots of
functions with about the same bias, but nothing provably better. Until
it does, I have no new 64-bit integer hash functions to offer.</p>

<h3 id="beyond-random-search">Beyond random search</h3>

<p>Right now the prospector does a completely random, unstructured search
hoping to stumble upon something good by chance. Perhaps it would be
worth using a genetic algorithm to breed those 5-tuples towards
optimum? Others have had <a href="https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html">success in this area with simulated
annealing</a>.</p>

<p>There’s probably more to exploit from the multiply-xorshift construction
that keeps popping up. If anything, the prospector is searching too
broadly, looking at constructions that could never really compete no
matter what the constants. In addition to everything above, I’ve been
looking for good 32-bit hash functions that don’t use any 32-bit
constants, but I’m really not finding any with a competitively low bias.</p>

<h3 id="update-after-one-week">Update after one week</h3>

<p>About one week after publishing this article I found an even better hash
function. I believe <strong>this is the least biased 32-bit integer hash
function <em>of this form</em> ever devised</strong>. It’s even less biased than the
MurmurHash3 finalizer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// exact bias: 0.17353355999581582</span>
<span class="kt">uint32_t</span>
<span class="nf">lowbias32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x7feb352dU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x846ca68bU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// inverse</span>
<span class="kt">uint32_t</span>
<span class="nf">lowbias32_r</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x43021123U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span> <span class="o">^</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">30</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x1d69e2a5U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you’re willing to use an additional round of multiply-xorshift, this
next function actually reaches the theoretical bias limit (bias =
~0.021) as exhibited by a perfect integer hash function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// exact bias: 0.020888578919738908</span>
<span class="kt">uint32_t</span>
<span class="nf">triple32</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xed5ad4bbU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0xac4c1b51U</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">15</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">*=</span> <span class="mh">0x31848babU</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">14</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><del>It’s statistically indistinguishable from a random permutation of all
32-bit integers.</del>(<em>Update 2025</em>: Peter Schmidt-Nielsen has provided a
second-order characteristic test that quickly identifies statistically
significant biases in <code class="language-plaintext highlighter-rouge">triple32</code>.)</p>

<h3 id="update-february-2020">Update, February 2020</h3>

<p>Some people have been experimenting with using my hash functions in GLSL
shaders, and the results are looking good:</p>

<ul>
  <li><a href="https://www.shadertoy.com/view/WttXWX">https://www.shadertoy.com/view/WttXWX</a></li>
  <li><a href="https://www.shadertoy.com/view/ttVGDV">https://www.shadertoy.com/view/ttVGDV</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The Value of Undefined Behavior</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/07/20/"/>
    <id>urn:uuid:9758e9ea-46b6-3904-5166-52c7e6922892</id>
    <updated>2018-07-20T21:31:18Z</updated>
    <category term="c"/><category term="cpp"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In several places, the C and C++ language specifications use a
curious, and fairly controversial, phrase: <em>undefined behavior</em>. For
certain program constructs, the specification prescribes no specific
behavior, instead allowing <a href="http://www.catb.org/jargon/html/N/nasal-demons.html">anything to happen</a>. Such constructs
are considered erroneous, and so the result depends on the particulars
of the platform and implementation. The original purpose of undefined
behavior was for implementation flexibility. In other words, it’s
slack that allows a compiler to produce appropriate and efficient code
for its target platform.</p>

<p>Specifying a particular behavior would have put unnecessary burden on
implementations — especially in the earlier days of computing — making
for inefficient programs on some platforms. For example, if the result
of dereferencing a null pointer was defined to trap — to cause the
program to halt with an error — then platforms that do not have
hardware trapping, such as those without virtual memory, would be
required to instrument, in software, each pointer dereference.</p>

<p>In the 21st century, undefined behavior has taken on a somewhat
different meaning. Optimizers use it — or <em>ab</em>use it depending on your
point of view — to lift <a href="/blog/2016/12/22/">constraints</a> that would otherwise
inhibit more aggressive optimizations. It’s not so much a
fundamentally different application of undefined behavior, but it does
take the concept to an extreme.</p>

<p>The reasoning works like this: A program that evaluates a construct
whose behavior is undefined cannot, by definition, have any meaningful
behavior, and so that program would be useless. As a result,
<a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">compilers assume programs never invoke undefined behavior</a> and
use those assumptions to prove its optimizations.</p>

<p>Under this newer interpretation, mistakes involving undefined behavior
are more <a href="https://kristerw.blogspot.com/2017/09/why-undefined-behavior-may-call-never.html">punishing</a> and <a href="/blog/2018/05/01/">surprising</a> than before. Programs
that <em>seem</em> to make some sense when run on a particular architecture may
actually compile into a binary with a security vulnerability due to
conclusions reached from an analysis of its undefined behavior.</p>

<p>This can be frustrating if your programs are intended to run on a very
specific platform. In this situation, all behavior really <em>could</em> be
locked down and specified in a reasonable, predictable way. Such a
language would be like an extended, less portable version of C or C++.
But your toolchain still insists on running your program on the
<em>abstract machine</em> rather than the hardware you actually care about.
However, <strong>even in this situation undefined behavior can still be
desirable</strong>. I will provide a couple of examples in this article.</p>

<h3 id="signed-integer-overflow">Signed integer overflow</h3>

<p>To start things off, let’s look at one of my all time favorite examples
of useful undefined behavior, a situation involving signed integer
overflow. The result of a signed integer overflow isn’t just
unspecified, it’s undefined behavior. Full stop.</p>

<p>This goes beyond a simple matter of whether or not the underlying
machine uses a two’s complement representation. From the perspective of
the abstract machine, just the act a signed integer overflowing is
enough to throw everything out the window, even if the overflowed result
is never actually used in the program.</p>

<p>On the other hand, unsigned integer overflow is defined — or, more
accurately, defined to wrap, <em>not</em> overflow. Both the undefined signed
overflow and defined unsigned overflow are useful in different
situations.</p>

<p>For example, here’s a fairly common situation, much like what <a href="https://www.youtube.com/watch?v=yG1OZ69H_-o&amp;t=38m18s">actually
happened in bzip2</a>. Consider this function that does substring
comparison:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">cmp_signed</span><span class="p">(</span><span class="kt">int</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">cmp_unsigned</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">i1</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">i2</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">c1</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i1</span><span class="p">];</span>
        <span class="kt">int</span> <span class="n">c2</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">i2</span><span class="p">];</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">c1</span> <span class="o">-</span> <span class="n">c2</span><span class="p">;</span>
        <span class="n">i1</span><span class="o">++</span><span class="p">;</span>
        <span class="n">i2</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this function, the indices <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> will always be some small,
non-negative value. Since it’s non-negative, it should be <code class="language-plaintext highlighter-rouge">unsigned</code>,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually <em>don’t</em> want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.</p>

<p>The constraint is that <strong>the behavior of <code class="language-plaintext highlighter-rouge">i1</code> or <code class="language-plaintext highlighter-rouge">i2</code> overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior.</strong> On x86-64, where <code class="language-plaintext highlighter-rouge">int</code> is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.</p>

<p>In the signed case, incrementing the integers cannot overflow since that
would be undefined behavior. This permits the compiler to perform the
increment only in 64-bit precision without truncation if it would be
more efficient, which, in this case, it is.</p>

<p>Here’s the output of Clang 6.0.0 with <code class="language-plaintext highlighter-rouge">-Os</code> on x86-64. Pay close
attention to the main loop, which I named <code class="language-plaintext highlighter-rouge">.loop</code>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">cmp_signed:</span>
        <span class="nf">movsxd</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; use i1 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">movsxd</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; use i2 as a 64-bit integer</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span><span class="p">]</span>
        <span class="nf">jmp</span>    <span class="nv">.check</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rdi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rsi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">rdx</span>                  <span class="c1">; increment only the base pointer</span>
<span class="nl">.check:</span> <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

        <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>             <span class="c1">; return c1 - c2</span>
        <span class="nf">ret</span>

<span class="nl">cmp_unsigned:</span>
        <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">jne</span>    <span class="nv">.ret</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>

<span class="nl">.loop:</span>  <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">edi</span>             <span class="c1">; truncated i1 overflow</span>
        <span class="nf">mov</span>    <span class="nb">al</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rax</span><span class="p">]</span>
        <span class="nf">mov</span>    <span class="nb">ecx</span><span class="p">,</span> <span class="nb">esi</span>             <span class="c1">; truncated i2 overflow</span>
        <span class="nf">mov</span>    <span class="nb">cl</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdx</span> <span class="o">+</span> <span class="nb">rcx</span><span class="p">]</span>
        <span class="nf">inc</span>    <span class="nb">edi</span>                  <span class="c1">; increment i1</span>
        <span class="nf">inc</span>    <span class="nb">esi</span>                  <span class="c1">; increment i2</span>
        <span class="nf">cmp</span>    <span class="nb">al</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">je</span>     <span class="nv">.loop</span>

<span class="nl">.ret:</span>   <span class="nf">movzx</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">al</span>
        <span class="nf">movzx</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="nb">cl</span>
        <span class="nf">sub</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ecx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>As unsigned values, <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, <code class="language-plaintext highlighter-rouge">buf</code>, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).</p>

<p>The signed integer helps to communicate the <em>narrow contract</em> of the
function — the limited range of <code class="language-plaintext highlighter-rouge">i1</code> and <code class="language-plaintext highlighter-rouge">i2</code> — to the compiler. In a
variant of C where signed integer overflow is defined (i.e. <code class="language-plaintext highlighter-rouge">-fwrapv</code>),
this capability is lost. In fact, using <code class="language-plaintext highlighter-rouge">-fwrapv</code> deoptimizes the signed
version of this function.</p>

<p>Side note: Using <code class="language-plaintext highlighter-rouge">size_t</code> (an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.</p>

<h3 id="strict-aliasing">Strict aliasing</h3>

<p>Another controversial undefined behavior is <a href="https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8"><em>strict aliasing</em></a>.
This particular term doesn’t actually appear anywhere in the C
specification, but it’s the popular name for C’s aliasing rules. In
short, variables with types that aren’t compatible are not allowed to
alias through pointers.</p>

<p>Here’s the classic example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>    <span class="c1">// store</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">b</span><span class="p">;</span> <span class="c1">// load</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Naively one might assume the <code class="language-plaintext highlighter-rouge">return *b</code> could be optimized to a simple
<code class="language-plaintext highlighter-rouge">return 0</code>. However, since <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.</p>

<p>If <code class="language-plaintext highlighter-rouge">foo</code> has a narrow contract that forbids <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> to alias, we
have a couple of options for helping our compiler.</p>

<p>First, we could manually resolve the aliasing issue by returning 0
explicitly. In more complicated functions this might mean making local
copies of values, working only with those local copies, then storing the
results back before returning. Then aliasing would no longer matter.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Second, C99 introduced a <code class="language-plaintext highlighter-rouge">restrict</code> qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to <code class="language-plaintext highlighter-rouge">memcpy()</code> are qualified with <code class="language-plaintext highlighter-rouge">restrict</code> as of C99.
Passing aliasing pointers through <code class="language-plaintext highlighter-rouge">restrict</code> parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="kr">restrict</span> <span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">long</code> are never compatible even
when they have the same representation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>If you use an extended or modified version of C without strict
aliasing (<code class="language-plaintext highlighter-rouge">-fno-strict-aliasing</code>), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.</p>

<p>What <a href="https://lkml.org/lkml/2003/2/26/158">irritates</a> a lot of people is that compilers will still
apply the strict aliasing rule even when it’s trivial for the compiler
to prove that aliasing is occurring:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* note: forbidden */</span>
<span class="kt">long</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s not just a simple matter of making exceptions for these cases.
The language specification would need to define all the rules about
when and where incompatible types are permitted to alias, and
developers would have to understand all these rules if they wanted to
take advantage of the exceptions. It can’t just come down to trusting
that the compiler is smart enough to see the aliasing when it’s
sufficiently simple. It would need to be carefully defined.</p>

<p>Besides, there are probably <a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">conforming, portable solutions</a>
that, with contemporary compilers, will safely compile to the efficient
code you actually want anyway.</p>

<p>There <em>is</em> one special exception for strict aliasing: <code class="language-plaintext highlighter-rouge">char *</code> is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a <code class="language-plaintext highlighter-rouge">char *</code> pointer could force the compiler to
generate additional, unnecessary loads.</p>

<p>In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: <code class="language-plaintext highlighter-rouge">uint8_t</code> is not necessarily <code class="language-plaintext highlighter-rouge">unsigned char</code>.
That’s just one possible <code class="language-plaintext highlighter-rouge">typedef</code> definition for it. It could instead
<code class="language-plaintext highlighter-rouge">typedef</code> to, say, some internal <code class="language-plaintext highlighter-rouge">__byte</code> type.</p>

<p>In other words, technically speaking, <code class="language-plaintext highlighter-rouge">uint8_t</code> does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110">willing to exploit it</a> for optimization purposes.</p>

<h3 id="other-undefined-behaviors">Other undefined behaviors</h3>

<p>Some kinds of undefined behavior don’t have performance or portability
benefits. They’re only there to make the compiler’s job a little
simpler. Today, most of these are caught trivially at compile time as
syntax or semantic issues (i.e. a pointer cast to a float).</p>

<p>Some others are obvious about their performance benefits and don’t
require much explanation. For example, it’s undefined behavior to
index out of bounds (with some special exceptions for one past the
end), meaning compilers are not obligated to generate those checks,
instead relying on the programmer to arrange, by whatever means, that
it doesn’t happen.</p>

<p>Undefined behavior is like nitro, a dangerous, volatile substance that
makes things go really, really fast. You could argue that it’s <em>too</em>
dangerous to use in practice, but the aggressive use of undefined
behavior is <a href="http://thoughtmesh.net/publish/367.php">not without merit</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Intercepting and Emulating Linux System Calls with Ptrace</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/06/23/"/>
    <id>urn:uuid:a39b7709-d0a6-3b12-159f-7445d9524594</id>
    <updated>2018-06-23T20:41:08Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>The <code class="language-plaintext highlighter-rouge">ptrace(2)</code> (“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing <a href="https://blog.plover.com/Unix/strace-groff.html">strace</a> — system call trace. With Ptrace, tracers
can pause tracees, <a href="/blog/2016/09/03/">inspect and set registers and memory</a>, monitor
system calls, or even <em>intercept</em> system calls.</p>

<p>By intercept, I mean that the tracer can mutate system call arguments,
mutate the system call return value, or even block certain system calls.
Reading between the lines, this means a tracer can fully service system
calls itself. This is particularly interesting because it also means <strong>a
tracer can emulate an entire foreign operating system</strong>. This is done
without any special help from the kernel beyond Ptrace.</p>

<p>The catch is that a process can only have one tracer attached at a time,
so it’s not possible emulate a foreign operating system while also
debugging that process with, say, GDB. The other issue is that emulated
systems calls will have higher overhead.</p>

<p>For this article I’m going to focus on <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html">Linux’s Ptrace</a> on
x86-64, and I’ll be taking advantage of a few Linux-specific extensions.
For the article I’ll also be omitting error checks, but the full source
code listings will have them.</p>

<p>You can find runnable code for the examples in this article here:</p>

<p><strong><a href="https://github.com/skeeto/ptrace-examples">https://github.com/skeeto/ptrace-examples</a></strong></p>

<h3 id="strace">strace</h3>

<p>Before getting into the really interesting stuff, let’s start by
reviewing a bare bones implementation of strace. It’s <a href="/blog/2018/01/17/">no
DTrace</a>, but strace is still incredibly useful.</p>

<p>Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The <code class="language-plaintext highlighter-rouge">ptrace(2)</code>
prototype generally looks something like this, though the specific
types may be different.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">ptrace</span><span class="p">(</span><span class="kt">int</span> <span class="n">request</span><span class="p">,</span> <span class="n">pid_t</span> <span class="n">pid</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">pid</code> is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.</p>

<p>The <code class="language-plaintext highlighter-rouge">request</code> field selects a specific Ptrace function, just like the
<code class="language-plaintext highlighter-rouge">ioctl(2)</code> interface. For strace, only two are needed:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code>: This process is to be traced by its parent.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code>: Continue, but stop at the next system call
entrance or exit.</li>
  <li><code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code>: Get a copy of the tracee’s registers.</li>
</ul>

<p>The other two fields, <code class="language-plaintext highlighter-rouge">addr</code> and <code class="language-plaintext highlighter-rouge">data</code>, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.</p>

<p>The strace interface is essentially a prefix to another command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace [strace options] program [arguments]
</code></pre></div></div>

<p>My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is <code class="language-plaintext highlighter-rouge">fork(2)</code> and <code class="language-plaintext highlighter-rouge">exec(2)</code> the
tracee process on the tail of <code class="language-plaintext highlighter-rouge">argv</code>. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pid_t</span> <span class="n">pid</span> <span class="o">=</span> <span class="n">fork</span><span class="p">();</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">pid</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span> <span class="cm">/* error */</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
    <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>  <span class="cm">/* child */</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_TRACEME</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">execvp</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">argv</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">FATAL</span><span class="p">(</span><span class="s">"%s"</span><span class="p">,</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The parent waits for the child’s <code class="language-plaintext highlighter-rouge">PTRACE_TRACEME</code> using <code class="language-plaintext highlighter-rouge">wait(2)</code>. When
<code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the child will be paused.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
<code class="language-plaintext highlighter-rouge">PTRACE_O_TRACEFORK</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETOPTIONS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">PTRACE_O_EXITKILL</span><span class="p">);</span>
</code></pre></div></div>

<p>All that’s left is a simple, endless loop that catches on system calls
one at a time. The body of the loop has four steps:</p>

<ol>
  <li>Wait for the process to enter the next system call.</li>
  <li>Print a representation of the system call.</li>
  <li>Allow the system call to execute and wait for the return.</li>
  <li>Print the system call return value.</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
<code class="language-plaintext highlighter-rouge">wait(2)</code> is needed to wait for the tracee to enter the desired state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">wait(2)</code> returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, <em>the operating system has not yet serviced this system call</em>.
This detail will be important later.</p>

<p>The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, <a href="/blog/2015/05/15/">the system call number is
passed in <code class="language-plaintext highlighter-rouge">rax</code></a>, and the arguments (up to 6) are passed in
<code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">r10</code>, <code class="language-plaintext highlighter-rouge">r8</code>, and <code class="language-plaintext highlighter-rouge">r9</code>. Reading the registers is
another Ptrace call, though there’s no need to <code class="language-plaintext highlighter-rouge">wait(2)</code> since the
tracee isn’t changing state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="kt">long</span> <span class="n">syscall</span> <span class="o">=</span> <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">;</span>

<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"%ld(%ld, %ld, %ld, %ld, %ld, %ld)"</span><span class="p">,</span>
        <span class="n">syscall</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rsi</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rdx</span><span class="p">,</span>
        <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r10</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r8</span><span class="p">,</span>  <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">r9</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s one caveat. For <a href="https://web.archive.org/web/20190323050358/https://stackoverflow.com/a/6469069">internal kernel purposes</a>, the system
call number is stored in <code class="language-plaintext highlighter-rouge">orig_rax</code> rather than <code class="language-plaintext highlighter-rouge">rax</code>. All the other
system call arguments are straightforward.</p>

<p>Next it’s another <code class="language-plaintext highlighter-rouge">PTRACE_SYSCALL</code> and <code class="language-plaintext highlighter-rouge">wait(2)</code>, then another
<code class="language-plaintext highlighter-rouge">PTRACE_GETREGS</code> to fetch the result. The result is stored in <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">" = %ld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">regs</span><span class="p">.</span><span class="n">rax</span><span class="p">);</span>
</code></pre></div></div>

<p>The output from this simple program is <em>very</em> crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use <code class="language-plaintext highlighter-rouge">process_vm_readv(2)</code> to
read those buffers from the tracee in order to print them appropriately.</p>

<p>However, this does lay the groundwork for system call interception.</p>

<h3 id="system-call-interception">System call interception</h3>

<p>Suppose we want to use Ptrace to implement something like OpenBSD’s
<a href="https://man.openbsd.org/pledge.2"><code class="language-plaintext highlighter-rouge">pledge(2)</code></a>, in which <a href="http://www.openbsd.org/papers/hackfest2015-pledge/mgp00001.html">a process <em>pledges</em> to use only a
restricted set of system calls</a>. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.</p>

<p>Before entering this main loop, a process can limit itself to the few
operations that it needs. If <a href="/blog/2017/07/19/">the program has a flaw</a> allowing it
to be exploited by bad input, the pledge significantly limits what the
exploit can accomplish.</p>

<p>Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call <code class="language-plaintext highlighter-rouge">exit(2)</code> in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.</p>

<p>The tricky part is that <strong>there’s no way to abort a system call once
it’s started</strong>. When tracer returns from <code class="language-plaintext highlighter-rouge">wait(2)</code> on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.</p>

<p>However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” <code class="language-plaintext highlighter-rouge">EPERM</code> error
in <code class="language-plaintext highlighter-rouge">errno</code> <a href="/blog/2016/09/23/">via the normal in-band signaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Enter next system call */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="cm">/* Is this system call permitted? */</span>
    <span class="kt">int</span> <span class="n">blocked</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">is_syscall_blocked</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">blocked</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// set to invalid syscall</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="cm">/* Run system call and stop on exit */</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSCALL</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">blocked</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* errno = EPERM */</span>
        <span class="n">regs</span><span class="p">.</span><span class="n">rax</span> <span class="o">=</span> <span class="o">-</span><span class="n">EPERM</span><span class="p">;</span> <span class="c1">// Operation not permitted</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (<code class="language-plaintext highlighter-rouge">open(2)</code>) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.</p>

<p>How <em>could</em> the tracee communicate to the tracer? Use an artificial
system call!</p>

<h3 id="creating-an-artificial-system-call">Creating an artificial system call</h3>

<p>For my new pledge-like system call — which I call <code class="language-plaintext highlighter-rouge">xpledge()</code> to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define SYS_xpledge 10000
</span></code></pre></div></div>

<p>Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s <code class="language-plaintext highlighter-rouge">pledge(2)</code>, which uses a <a href="https://www.tedunangst.com/flak/post/string-interfaces">string interface</a>.
<em>Actually</em> designing robust and secure sets of privileges is really
complicated, as the <code class="language-plaintext highlighter-rouge">pledge(2)</code> manpage shows. Here’s the entire
interface <em>and</em> implementation of the system call for the tracee:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="cp">#define XPLEDGE_RDWR  (1 &lt;&lt; 0)
#define XPLEDGE_OPEN  (1 &lt;&lt; 1)
</span>
<span class="cp">#define xpledge(arg) syscall(SYS_xpledge, arg)
</span></code></pre></div></div>

<p>If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. <code class="language-plaintext highlighter-rouge">brk(2)</code>). The
<code class="language-plaintext highlighter-rouge">PLEDGE_RDWR</code> bit allows <a href="/blog/2017/03/01/">various</a> read and write system calls
(<code class="language-plaintext highlighter-rouge">read(2)</code>, <code class="language-plaintext highlighter-rouge">readv(2)</code>, <code class="language-plaintext highlighter-rouge">pread(2)</code>, <code class="language-plaintext highlighter-rouge">preadv(2)</code>, etc.). The
<code class="language-plaintext highlighter-rouge">PLEDGE_OPEN</code> bit allows <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<p>To prevent privileges from being escalated back, <code class="language-plaintext highlighter-rouge">pledge()</code> blocks
itself — though this also prevents dropping more privileges later down
the line.</p>

<p>In the xpledge tracer, I just need to check for this system call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle entrance */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">register_pledge</span><span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">rdi</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The operating system will return <code class="language-plaintext highlighter-rouge">ENOSYS</code> (Function not implemented)
since this isn’t a <em>real</em> system call. So on the way out I overwrite
this with a success (0).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Handle exit */</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">SYS_pledge</span><span class="p">:</span>
        <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_POKEUSER</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="n">RAX</span> <span class="o">*</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I wrote a little test program that opens <code class="language-plaintext highlighter-rouge">/dev/urandom</code>, makes a read,
tries to pledge, then tries to open <code class="language-plaintext highlighter-rouge">/dev/urandom</code> a second time, then
confirms it can read from the original <code class="language-plaintext highlighter-rouge">/dev/urandom</code> file descriptor.
Running without a pledge tracer, the output looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
</code></pre></div></div>

<p>Making an invalid system call doesn’t crash an application. It just
fails, which is a rather convenient fallback. When run under the
tracer, it looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
</code></pre></div></div>

<p>The pledge succeeds but the second <code class="language-plaintext highlighter-rouge">fopen(3)</code> does not since the tracer
blocked it with <code class="language-plaintext highlighter-rouge">EPERM</code>.</p>

<p>This concept could be taken much further, to, say, change file paths or
return fake results. A tracer could effectively chroot its tracee,
prepending some chroot path to the root of any path passed through a
system call. It could even lie to the process about what user it is,
claiming that it’s running as root. In fact, this is exactly how the
<a href="https://fakeroot-ng.lingnu.com/index.php/Home_Page">Fakeroot NG</a> program works.</p>

<h3 id="foreign-system-emulation">Foreign system emulation</h3>

<p>Suppose you don’t just want to intercept <em>some</em> system calls, but
<em>all</em> system calls. You’ve got <a href="/blog/2017/11/30/">a binary intended to run on another
operating system</a>, so none of the system calls it makes will ever
work.</p>

<p>You could manage all this using only what I’ve described so far. The
tracer would always replace the system call number with a dummy, allow
it to fail, then service the system call itself. But that’s really
inefficient. That’s essentially three context switches for each system
call: one to stop on the entrance, one to make the always-failing
system call, and one to stop on the exit.</p>

<p>The Linux version of PTrace has had a more efficient operation for
this technique since 2005: <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>. PTrace stops only <em>once</em>
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_SYSEMU</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>
    <span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_GETREGS</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

    <span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">orig_rax</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">OS_read</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_write</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_open</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="k">case</span> <span class="n">OS_exit</span><span class="p">:</span>
            <span class="cm">/* ... */</span>

        <span class="cm">/* ... and so on ... */</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this <code class="language-plaintext highlighter-rouge">PTRACE_SYSEMU</code>
tracer, a loader (to take the place of <code class="language-plaintext highlighter-rouge">exec(2)</code>), and whatever system
libraries the binary needs (or only run static binaries).</p>

<p>In fact, this sounds like a fun weekend project.</p>

<h3 id="see-also">See also</h3>

<ul>
  <li><a href="https://www.youtube.com/watch?v=uXgxMDglxVM">Implementing a clone of OpenBSD pledge into the Linux kernel</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Minimalist C Libraries</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/06/10/"/>
    <id>urn:uuid:bb1ed0bd-ef15-3710-ad47-2365bda0822b</id>
    <updated>2018-06-10T01:40:16Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>In the past year I’ve written a number of minimalist C libraries,
particularly header libraries. The distinction for “minimalist” is, of
course, completely arbitrary and subjective. My definition in this
context isn’t about the library’s functionality being <a href="https://www.npmjs.com/package/is-odd">stupidly
trivial</a> or even necessarily simple. I’m talking about
interface (API) complexity and the library’s run time requirements.
Complex functionality can, in some cases, be tucked behind a simple
interface.</p>

<p>In this article I’ll give my definition for minimalist C API, then take
you through some of my own recent examples.</p>

<h3 id="minimalist-properties">Minimalist properties</h3>

<p>A minimalist C library would generally have these properties.</p>

<h4 id="1-small-number-of-functions-perhaps-even-as-little-as-one">(1) Small number of functions, perhaps even as little as one.</h4>

<p>This one’s pretty obvious. More functions means more surface area in
the interface. Since these functions typically interact, the
relationship between complexity and number of functions will be
superlinear.</p>

<h4 id="2-no-dynamic-memory-allocations">(2) No dynamic memory allocations.</h4>

<p>The library mustn’t call <code class="language-plaintext highlighter-rouge">malloc()</code> internally. It’s up to the caller
to allocate memory for the library. What’s nice about this is that
it’s completely up to the application exactly how memory is allocated.
Maybe it’s using a custom allocator, or it’s not <a href="/blog/2016/01/31/">linked against the
standard library</a>.</p>

<p>A common approach is for the application to provide allocation functions
to the library — e.g. function pointers at run time, or define functions
with specific, expected names. The library would call these instead of
<code class="language-plaintext highlighter-rouge">malloc()</code> and <code class="language-plaintext highlighter-rouge">free()</code>. While that’s perfectly reasonable, it’s not
really <em>minimalist</em>, so I’m not including this technique.</p>

<p>Instead a minimalist API is designed such that it’s natural for the
application to make the allocations itself. Perhaps the library only
needs a single, fixed allocation for all its operations. Or maybe the
application specifies its requirements and the library communicates
how much memory is needed to meet those requirements. I’ll give
specific examples shortly.</p>

<p>One nice result of this property is that it eliminates one of the
common failure conditions: the out of memory error. If the library
doesn’t allocate memory, then it can’t run out of it!</p>

<p>Another convenient, minor outcome is the lack of casts from <code class="language-plaintext highlighter-rouge">void *</code> to
the appropriate type (e.g. on the return from <code class="language-plaintext highlighter-rouge">malloc()</code>). These casts
are implicit in C but must be made explicit in C++. Often, completely by
accident, my minimalist C libraries can be compiled as C++ without any
changes. This is only a minor benefit since these casts could be made
explicit in C, too, if C++ compatibility was desired. It’s just ugly.</p>

<h4 id="3-no-input-or-output">(3) No input or output.</h4>

<p>In simple terms, the library mustn’t use functions from <code class="language-plaintext highlighter-rouge">stdio.h</code> —
with the exception of the <code class="language-plaintext highlighter-rouge">sprintf()</code> family. Like with memory
allocation, it leaves input and output to the application, letting it
decide exactly how, where, and when information comes and goes.</p>

<p>Like with memory allocation, maybe the application prefers not to use
the C standard library’s buffered IO. Perhaps the application is using
<a href="/blog/2018/05/31/">cooperative</a> or green threads, and it would be bad for the
library to block internally on IO.</p>

<p>Also like avoiding memory allocation, a library that doesn’t perform IO
can’t have IO errors. Combined, this means it’s quite possible that <em>a
minimalist library may have no error cases at all</em>. Eliminating those
error handling paths makes the library a lot simpler. The one major
error condition left that’s difficult to eliminate are those <a href="/blog/2017/07/19/">pesky
integer overflow checks</a>.</p>

<p>Communicating IO preferences to libraries can be a real problem with
C, since the standard library lacks generic input and output. Putting
<code class="language-plaintext highlighter-rouge">FILE *</code> pointers directly into an API mingles it with the C standard
library in potentially bad ways. Passing file names as strings is an
option, but this limits IO to files — versus, say, sockets. On POSIX
systems, at least it could talk about IO in terms of file descriptors,
but even that’s not entirely flexible — e.g. output to a memory
buffer, or anything not sufficiently file-like.</p>

<p>Again, a common way to deal with this is for the application to
provide IO function pointers to the library. But a minimalist
library’s API would be designed such that not even this is needed,
instead operating strictly on buffers. I’ll also have a couple
examples of this shortly.</p>

<p>With IO and memory allocation out of the picture, another frequent,
accidental result is no dependency on the C standard library. The only
significant functionality left in the standard library are the
mathematical functions (<code class="language-plaintext highlighter-rouge">math.h</code>), <a href="https://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">float parsing</a>, and a few of
the string functions (<code class="language-plaintext highlighter-rouge">string.h</code>), like <code class="language-plaintext highlighter-rouge">memset()</code> and <code class="language-plaintext highlighter-rouge">memmove()</code>.
These are valuable since they’re <a href="/blog/2018/05/01/">handled specially by the
compiler</a>.</p>

<h4 id="4-define-at-most-one-structure-and-perhaps-even-none">(4) Define at most one structure, and perhaps even none.</h4>

<p>More types means more complexity, perhaps even more so than having lots
of functions. Some minimalist libraries can be so straightforward that
they can operate solely on simple, homogeneous buffers. I’ll show some
examples of this, too.</p>

<p>As I said initially, minimalism is about interface, not implementation.
The library is free to define as many structures internally as it needs
since the application won’t be concerned with them.</p>

<p>One common way to avoid complicated types in an API is to make them
<em>opaque</em>. The structures aren’t defined in the API, and instead the
application only touches pointers, making them like handles.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">foo</span><span class="p">;</span>

<span class="k">struct</span> <span class="n">foo</span> <span class="o">*</span><span class="nf">foo_create</span><span class="p">(...);</span>
<span class="kt">int</span>         <span class="nf">foo_method</span><span class="p">(</span><span class="k">struct</span> <span class="n">foo</span> <span class="o">*</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">void</span>        <span class="nf">foo_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">foo</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>However, this is difficult to pull off when the library doesn’t allocate
its own memory.</p>

<h3 id="bitmap-library">Bitmap library</h3>

<p>The first example is a library for creating bitmap (BMP) images. As you
may already know, I <a href="/blog/2017/11/03/">strongly prefer Netpbm</a>, which is so simple
that it doesn’t even need a library. But nothing is quite so universally
supported as BMP.</p>

<p><strong><a href="https://github.com/skeeto/bmp">24-bit BMP (Bitmap) ANSI C header library</a></strong></p>

<p>This library is a perfect example of minimalist properties 2, 3, and 4.
It also doesn’t use any of the C standard library, though only by
accident.</p>

<p>It’s not a general purpose BMP library. It only supports 24-bit true
color, ignoring most BMP features such as palettes. Color is
represented as a 24-bit integer, packed <code class="language-plaintext highlighter-rouge">0xRRGGBB</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">bmp_size</span><span class="p">(</span><span class="kt">long</span> <span class="n">width</span><span class="p">,</span> <span class="kt">long</span> <span class="n">height</span><span class="p">);</span>
<span class="kt">void</span>          <span class="nf">bmp_init</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="n">width</span><span class="p">,</span> <span class="kt">long</span> <span class="n">height</span><span class="p">);</span>
<span class="kt">void</span>          <span class="nf">bmp_set</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">,</span> <span class="kt">long</span> <span class="n">y</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">color</span><span class="p">);</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">bmp_get</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">,</span> <span class="kt">long</span> <span class="n">y</span><span class="p">);</span>
</code></pre></div></div>

<p>Strictly speaking, even the <code class="language-plaintext highlighter-rouge">bmp_get()</code> function could be tossed since
the library is not intended to load external bitmap images. The
application really shouldn’t <em>need</em> to read back previously set pixels.</p>

<p>There is no allocation, no IO, and no data structures. The application
indicates the dimensions of image it wants to create, and the library
says how large of a buffer it needs. The remaining functions all operate
on this opaque buffer. To write the image out, the application only
needs to dump the buffer to a file.</p>

<p>Here’s a complete, strict error checking example of its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RED   0xff0000UL
#define BLUE  0x0000ffUL
</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">size</span> <span class="o">=</span> <span class="n">bmp_size</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">size</span> <span class="o">||</span> <span class="n">size</span> <span class="o">&gt;</span> <span class="n">SIZE_MAX</span><span class="p">)</span> <span class="n">die</span><span class="p">(</span><span class="s">"invalid dimensions"</span><span class="p">);</span>

<span class="kt">void</span> <span class="o">*</span><span class="n">bmp</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">bmp</span><span class="p">)</span> <span class="n">die</span><span class="p">(</span><span class="s">"out of memory"</span><span class="p">);</span>
<span class="n">bmp_init</span><span class="p">(</span><span class="n">bmp</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>

<span class="cm">/* Checkerboard pattern */</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span>
        <span class="n">bmp_set</span><span class="p">(</span><span class="n">bmp</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="n">y</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">?</span> <span class="n">RED</span> <span class="o">:</span> <span class="n">BLUE</span><span class="p">);</span>

<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">fwrite</span><span class="p">(</span><span class="n">bmp</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">out</span><span class="p">))</span>
    <span class="n">die</span><span class="p">(</span><span class="s">"output error"</span><span class="p">);</span>

<span class="n">free</span><span class="p">(</span><span class="n">bmp</span><span class="p">);</span>
</code></pre></div></div>

<p>The only library function that can fail is <code class="language-plaintext highlighter-rouge">bmp_size()</code>. When the
given image dimensions would overflow one of the BMP header fields, it
returns zero to indicate as such.</p>

<p>In <code class="language-plaintext highlighter-rouge">bmp_set()</code>, how does it know the dimensions of the image so that
it can find the pixel? It reads that from the buffer just like a BMP
reader would — <em>and</em> in a <a href="https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html">endian-agnostic manner</a>. There are
no bounds checks — that’s the caller’s job — so it only needs to read
the image’s <em>width</em> in order to find the pixel’s location.</p>

<p>Since IO is under control of the application, it can always choose load
the original buffer contents <em>back</em> from a file, allowing a minimal sort
of BMP loading. However, this only works for <em>trusted</em> input as there
are no validation checks on the buffer.</p>

<h3 id="32-bit-integer-hash-set-library">32-bit integer hash set library</h3>

<p>The second example is an integer hash set library. It uses closed
hashing. I initially wrote this for <a href="https://old.reddit.com/r/dailyprogrammer/">r/dailyprogrammer</a> solution and
then formalized it into a little reusable library.</p>

<p><strong><a href="https://github.com/skeeto/scratch/tree/master/set32">C99 32-bit integer hash set header library</a></strong></p>

<p>Here’s the entire API:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">set32_z</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">max</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">set32_insert</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">table</span><span class="p">,</span> <span class="kt">int</span> <span class="n">z</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">v</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">set32_remove</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">table</span><span class="p">,</span> <span class="kt">int</span> <span class="n">z</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">v</span><span class="p">);</span>
<span class="kt">int</span>  <span class="nf">set32_contains</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">table</span><span class="p">,</span> <span class="kt">int</span> <span class="n">z</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">v</span><span class="p">);</span>
</code></pre></div></div>

<p>Again, it’s a good example of properties 2, 3, and 4. Like the BMP
library, the application indicates the maximum number of integers it
will store in the hash set, and the library returns the <em>power of two</em>
number of <code class="language-plaintext highlighter-rouge">uint32_t</code> it needs to allocate (and zero-initialize).</p>

<p>In this API I’m just barely skirting not defining a data structure.
The caller must pass both the table pointer and the power of two size,
and these two values would normally be bundled together into a
structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">z</span> <span class="o">=</span> <span class="n">set32_z</span><span class="p">(</span><span class="n">max</span><span class="p">);</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">1ULL</span> <span class="o">&lt;&lt;</span> <span class="n">z</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">n</span> <span class="o">&gt;</span> <span class="n">SIZE_MAX</span><span class="p">)</span> <span class="n">die</span><span class="p">(</span><span class="s">"table too large"</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">table</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">table</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">)</span> <span class="n">die</span><span class="p">(</span><span class="s">"out of memory"</span><span class="p">);</span>

<span class="n">set32_insert</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>

<span class="k">if</span> <span class="p">(</span><span class="n">set32_contains</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">value</span><span class="p">))</span>
    <span class="cm">/* ... */</span><span class="p">;</span>

<span class="n">set32_remove</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>

<span class="n">free</span><span class="p">(</span><span class="n">table</span><span class="p">);</span>
</code></pre></div></div>

<p>Iteration is straightforward, which is why it’s not in the API: visit
each element in the allocated buffer. Zeroes are empty slots.</p>

<p>If a different maximum number of elements is needed, the application
initializes a new, separate table, then iterates over the old table
inserting each integer in turn.</p>

<p>Perhaps the most interesting part of the API is that <em>it has no errors</em>.
No function can fail.</p>

<p>Also, like the BMP library, it accidentally doesn’t use the standard
library, except for a <code class="language-plaintext highlighter-rouge">typedef</code> from <code class="language-plaintext highlighter-rouge">stdint.h</code>.</p>

<h3 id="fantasy-name-generator">Fantasy name generator</h3>

<p>Nearly a decade ago I <a href="/blog/2009/01/04">cloned in Perl</a> the <a href="http://www.rinkworks.com/namegen/">RinkWorks Fantasy Name
Generator</a>. This version was slow and terrible, and I’m sometimes
tempted to just delete it.</p>

<p>A few years later I <a href="/blog/2013/03/27/">rewrote it in JavaScript</a> using an entirely
different approach. In order to improve performance, it has a template
compilation step. The compiled template is a hierarchical composition of
simple generator objects. It’s much faster, and easily enabled some
extensions to the syntax.</p>

<p><a href="https://github.com/skeeto/fantasyname/pull/2">Germán Méndez Bravo ported the JavaScript version to C++</a>. This
C++ implementation <a href="https://github.com/Attnam/ivan/pull/363">was recently adopted</a> into <a href="https://github.com/Attnam/ivan">IVAN</a>, a
roguelike game.</p>

<p>This recent commotion made me realize something: <em>I hadn’t yet
implemented it in C!</em> So I did.</p>

<p><strong><a href="https://github.com/skeeto/fantasyname/blob/master/c/namegen.h">Fantasy name generator ANSI C header library</a></strong></p>

<p>The entire API is just a single function with four possible return
values. It’s a perfect example of minimalist property 1.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define NAMEGEN_SUCCESS    0
#define NAMEGEN_TRUNCATED  1  </span><span class="cm">/* Output was truncated */</span><span class="cp">
#define NAMEGEN_INVALID    2  </span><span class="cm">/* Pattern is invalid */</span><span class="cp">
#define NAMEGEN_TOO_DEEP   3  </span><span class="cm">/* Exceeds maximum nesting depth */</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">namegen</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">dest</span><span class="p">,</span>
            <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span>
            <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">,</span>
            <span class="kt">unsigned</span> <span class="kt">long</span> <span class="o">*</span><span class="n">seed</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s no template compilation step, and it generates names straight
from the template.</p>

<p>There are three kinds of errors.</p>

<ol>
  <li>
    <p>If the output buffer wasn’t large enough, it warns about the name
being truncated.</p>
  </li>
  <li>
    <p>The template could be invalid — e.g. incorrectly paired brackets.</p>
  </li>
  <li>
    <p>The template could have too much nesting. I decided to hard code the
maximum nesting depth to a generous 32 levels. This limitation makes
the generator a lot simpler without any practical impact. It also
protects against unbounded memory usage — particularly stack
overflows — by arbitrarily complex patterns. This means it’s
perfectly safe to generate names from untrusted, arbitrarily long
input patterns.</p>
  </li>
</ol>

<p>Here’s a usage example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">seed</span> <span class="o">=</span> <span class="mh">0xb9584b61UL</span><span class="p">;</span>
<span class="n">namegen</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">name</span><span class="p">),</span> <span class="s">"!sV'i (the |)!id"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">seed</span><span class="p">);</span>
<span class="cm">/* name = "Engia'pin the Doltolph" */</span>
</code></pre></div></div>

<p>The generator supports UTF-8, almost by accident. (I’d have to go out of
my way <em>not</em> to support it.)</p>

<p>Despite the lack of a compilation step, which requires the parsing the
template for each generated name, it’s <em>an order of magnitude faster
than the C++ version</em>, which caught me by surprise. The high
performance is due to name generation being a single pass over the
template using <a href="https://en.wikipedia.org/wiki/Reservoir_sampling">reservoir sampling</a>.</p>

<p>Internally it maintains a stack of “reset” pointers, each pointing
into the output buffer where the current nesting level began its
output. Each time it hits an alternation (<code class="language-plaintext highlighter-rouge">|</code>), it generates a random
number and decides whether or not to use the new option. The first
time it’s a 1/2 chance it chooses the new option. The second time, a
1/3 chance. The third time a 1/4 chance, and so on. When the new
option is selected, the reset pointer is used to “undo” any previous
output for the current nesting level.</p>

<p>The reservoir sampling means it needs to generate more random numbers
(once per option) than the JavaScript and C++ version (once per nesting
level). However, it uses <a href="/blog/2017/09/21/">its own, fast internal PRNG</a> rather than
<code class="language-plaintext highlighter-rouge">rand()</code>. Generating these random numbers is basically free.</p>

<p>Not using <code class="language-plaintext highlighter-rouge">rand()</code> means that, like the previous libraries, it doesn’t
need anything from the standard library. It also has better quality
results since the typical standard library <code class="language-plaintext highlighter-rouge">rand()</code> is total rubbish,
both in terms of speed and quality (and typically has a <a href="/blog/2018/05/27/">PLT
penalty</a>). Finally it means the results are identical across all
platforms for the same template and seed, which is one reason it’s part
of the API.</p>

<p>Another slight performance boost comes from the representation of
pattern substitutions, i.e. <code class="language-plaintext highlighter-rouge">i</code> will select a random “idiot” name from a
fixed selection of strings. The obvious representation is an array of
string pointers, as seen in the C++ version. However, there are a lot of
these little strings, which makes for a lot of pointers <a href="/blog/2016/12/23/">cluttering up
the relocation table</a>. Instead, I packed it all into few small
pointerless tables, which on x86-64 are accessed efficiently via
RIP-relative addressing. It’s efficient, though not friendly to
modification.</p>

<p>I’m very happy with how this library turned out.</p>

<h3 id="utf-7-encoder-and-decoder">UTF-7 encoder and decoder</h3>

<p>The last example is a <a href="https://en.wikipedia.org/wiki/UTF-7">UTF-7</a> encoder and decoder. UTF-7 is a method
for encoding arbitrary Unicode text within ASCII text, created as a
nasty hack to allow Unicode messages to be sent over ASCII-limited email
infrastructure. The gist of it is that the Unicode parts of a message
are encoded as UTF-16, then base64 encoded, then interpolated into the
ASCII stream between delimiters.</p>

<p>Einstein (allegedly) said “If you can’t explain it to a six year old,
you don’t understand it yourself.” The analog for programming is to
replace the six year old with a computer, and explaining an idea to a
computer is done by writing a program. I wanted to understand UTF-7, so
I implemented it.</p>

<p><strong><a href="https://github.com/skeeto/utf-7">A UTF-7 stream encoder and decoder in ANSI C</a></strong></p>

<p>Here’s the entire API. It’s modeled a little after the <a href="https://zlib.net/">zlib</a> API.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* utf7_encode() special code points */</span>
<span class="cp">#define UTF7_FLUSH       -1L
</span>
<span class="cm">/* return codes */</span>
<span class="cp">#define UTF7_OK          -1
#define UTF7_FULL        -2
#define UTF7_INCOMPLETE  -3
#define UTF7_INVALID     -4
</span>
<span class="k">struct</span> <span class="n">utf7</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">len</span><span class="p">;</span>
    <span class="cm">/* then some "private" internal fields */</span>
<span class="p">};</span>

<span class="kt">void</span> <span class="nf">utf7_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">utf7</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">indirect</span><span class="p">);</span>
<span class="kt">int</span>  <span class="nf">utf7_encode</span><span class="p">(</span><span class="k">struct</span> <span class="n">utf7</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="n">codepoint</span><span class="p">);</span>
<span class="kt">long</span> <span class="nf">utf7_decode</span><span class="p">(</span><span class="k">struct</span> <span class="n">utf7</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally a library that defines a structure! The other fields (not shown)
hold important state information, but the application is only concerned
with <code class="language-plaintext highlighter-rouge">buf</code> and <code class="language-plaintext highlighter-rouge">len</code>: an input or output buffer. The same structure is
used for encoding and decoding, though only for one task at a time.</p>

<p>Following the minimalist library principle, there is no memory
allocation. When encoding a UTF-7 stream, the application’s job is to
point <code class="language-plaintext highlighter-rouge">buf</code> to an output buffer, indicating its length with <code class="language-plaintext highlighter-rouge">len</code>. Then
it feeds code points one at a time into the encoder. When the output is
full, it returns <code class="language-plaintext highlighter-rouge">UTF7_FULL</code>. The application must provide a new buffer
and try again.</p>

<p>This example usage is more complicated than I anticipated it would be.
Properly pumping code points through the encoder requires a loop (or
at least a second attempt).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buffer</span><span class="p">[</span><span class="mi">1024</span><span class="p">];</span>
<span class="k">struct</span> <span class="n">utf7</span> <span class="n">ctx</span><span class="p">;</span>

<span class="n">utf7_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">buf</span> <span class="o">=</span> <span class="n">buffer</span><span class="p">;</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">));</span>

<span class="cm">/* Assumes "wide character" input is Unicode */</span>
<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="kt">wint_t</span> <span class="n">c</span> <span class="o">=</span> <span class="n">fgetwc</span><span class="p">(</span><span class="n">stdin</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">c</span> <span class="o">==</span> <span class="n">WEOF</span><span class="p">)</span>
        <span class="k">break</span><span class="p">;</span>

    <span class="k">while</span> <span class="p">(</span><span class="n">utf7_encode</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="o">!=</span> <span class="n">UTF7_OK</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Flush output and reset buffer */</span>
        <span class="n">fwrite</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
        <span class="n">ctx</span><span class="p">.</span><span class="n">buf</span> <span class="o">=</span> <span class="n">buffer</span><span class="p">;</span>
        <span class="n">ctx</span><span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">));</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="cm">/* Flush all pending output */</span>
<span class="k">while</span> <span class="p">(</span><span class="n">utf7_encode</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">UTF7_FLUSH</span><span class="p">)</span> <span class="o">!=</span> <span class="n">UTF7_OK</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="n">ctx</span><span class="p">.</span><span class="n">buf</span> <span class="o">=</span> <span class="n">buffer</span><span class="p">;</span>
    <span class="n">ctx</span><span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">));</span>
<span class="p">}</span>

<span class="cm">/* Write remaining output */</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">)</span> <span class="o">-</span> <span class="n">ctx</span><span class="p">.</span><span class="n">len</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>

<span class="cm">/* Check for errors */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fflush</span><span class="p">(</span><span class="n">stdout</span><span class="p">))</span>
    <span class="n">die</span><span class="p">(</span><span class="s">"output error"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ferror</span><span class="p">(</span><span class="n">stdin</span><span class="p">))</span>
    <span class="n">die</span><span class="p">(</span><span class="s">"input error"</span><span class="p">);</span>
</code></pre></div></div>

<p>Flushing (<code class="language-plaintext highlighter-rouge">UTF7_FLUSH</code>) is necessary since, due to base64 encoding,
adjacent Unicode characters usually share a base64 character. Just
because a code point was absorbed into the encoder doesn’t mean it was
written into the output buffer. The encoding for that character may
depend on the next character to come. The special “flush” input forces
this out. It’s valid to flush in the middle of a stream, though this
<a href="/blog/2016/09/09/">may penalize encoding efficiency</a> (e.g. the output may be
larger than necessary).</p>

<p>It’s not possible for the encoder to fail, so there are no error
conditions to worry about from the library.</p>

<p>Decoding is a different matter. It works almost in reverse from the
encoder: <code class="language-plaintext highlighter-rouge">buf</code> points to the <em>input</em> and the decoder is pumped to
return one code point at a time. It returns one of:</p>

<ol>
  <li>
    <p>A non-negative value: a valid code point (including ASCII).</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">UTF7_OK</code>: Input was exhausted. Stopping here would be valid. This is
what you should get when there’s no more input.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">UTF7_INVALID</code>: The input was invalid. <code class="language-plaintext highlighter-rouge">buf</code> points at the invalid byte.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">UTF7_INCOMPLETE</code>: Input was exhausted, but more is expected. If
there is no more input, then the input must have been truncated,
which is an error.</p>
  </li>
</ol>

<p>So there are two possible errors for two kinds of invalid input.
Parsing errors are unavoidable when parsing input.</p>

<p>Again, this library accidentally doesn’t require the standard library.
It doesn’t even <a href="https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compiler-in-40-days.html#day42">depend on the compiler’s locale</a> being
compatible with ASCII since none of its internal tables use string or
character literals. It behaves <em>exactly the same</em> across all conforming
platforms.</p>

<h3 id="more-examples">More examples</h3>

<p>I had a few more examples in mind, but this article has gone on long
enough.</p>

<ul>
  <li><a href="https://github.com/skeeto/scratch/tree/master/elsiefour">ANSI C implementation of ElsieFour (LC4)</a></li>
  <li><a href="https://github.com/skeeto/getopt">A minimal POSIX getopt() ANSI C header library</a></li>
  <li><a href="https://github.com/skeeto/growable-buf">Growable Memory Buffers for C99</a></li>
</ul>

<p>Instead I’ll save these for other articles!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>When FFI Function Calls Beat Native C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/27/"/>
    <id>urn:uuid:cb339e3b-382e-3762-4e5c-10cf049f7627</id>
    <updated>2018-05-27T20:03:15Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update: There’s a good discussion on <a href="https://news.ycombinator.com/item?id=17171252">Hacker News</a>.</em></p>

<p>Over on GitHub, David Yu has an interesting performance benchmark for
function calls of various Foreign Function Interfaces (<a href="https://en.wikipedia.org/wiki/Foreign_function_interface">FFI</a>):</p>

<p><a href="https://github.com/dyu/ffi-overhead">https://github.com/dyu/ffi-overhead</a></p>

<p>He created a shared object (<code class="language-plaintext highlighter-rouge">.so</code>) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.</p>

<p>For the C “FFI” he used standard dynamic linking, not <code class="language-plaintext highlighter-rouge">dlopen()</code>. This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.</p>

<p>The most surprising result of the benchmark is that
<strong><a href="http://luajit.org/">LuaJIT’s</a> FFI is substantially faster than C</strong>. It’s about
25% faster than a native C function call to a shared object function.
How could a weakly and dynamically typed scripting language come out
ahead on a benchmark? Is this accurate?</p>

<p>It’s actually quite reasonable. The benchmark was run on Linux, so the
performance penalty we’re seeing comes the <em>Procedure Linkage Table</em>
(PLT). I’ve put together a really simple experiment to demonstrate the
same effect in plain old C:</p>

<p><a href="https://github.com/skeeto/dynamic-function-benchmark">https://github.com/skeeto/dynamic-function-benchmark</a></p>

<p>Here are the results on an Intel i7-6700 (Skylake):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
</code></pre></div></div>

<p>These are three different types of function calls:</p>

<ol>
  <li>Through the PLT</li>
  <li>An indirect function call (via <code class="language-plaintext highlighter-rouge">dlsym(3)</code>)</li>
  <li>A direct function call (via a JIT-compiled function)</li>
</ol>

<p>As shown, the last one is the fastest. It’s typically not an option
for C programs, but it’s natural in the presence of a JIT compiler,
including, apparently, LuaJIT.</p>

<p>In my benchmark, the function being called is named <code class="language-plaintext highlighter-rouge">empty()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">empty</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
</code></pre></div></div>

<p>And to compile it into a shared object:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -shared -fPIC -Os -o empty.so empty.c
</code></pre></div></div>

<p>Just as in my <a href="/blog/2017/09/21/">PRNG shootout</a>, the benchmark calls this function
repeatedly as many times as possible before an alarm goes off.</p>

<h3 id="procedure-linkage-tables">Procedure Linkage Tables</h3>

<p>When a program or library calls a function in another shared object,
the compiler cannot know where that function will be located in
memory. That information isn’t known until run time, after the program
and its dependencies are loaded into memory. These are usually at
randomized locations — e.g. <em>Address Space Layout Randomization</em>
(ASLR).</p>

<p>How is this resolved? Well, there are a couple of options.</p>

<p>One option is to make a note about each such call in the binary’s
metadata. The run-time dynamic linker can then <em>patch</em> in the correct
address at each call site. How exactly this would work depends on the
particular <a href="https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">code model</a> used when compiling the binary.</p>

<p>The downside to this approach is slower loading, larger binaries, and
less <a href="/blog/2016/04/10/">sharing of code pages</a> between different processes. It’s
slower loading because every dynamic call site needs to be patched
before the program can begin execution. The binary is larger because
each of these call sites needs an entry in the relocation table. And the
lack of sharing is due to the code pages being modified.</p>

<p>On the other hand, the overhead for dynamic function calls would be
eliminated, giving JIT-like performance as seen in the benchmark.</p>

<p>The second option is to route all dynamic calls through a table. The
original call site calls into a stub in this table, which jumps to the
actual dynamic function. With this approach the code does not need to
be patched, meaning it’s <a href="/blog/2016/12/23/">trivially shared</a> between processes.
Only one place needs to be patched per dynamic function: the entries
in the table. Even more, these patches can be performed <em>lazily</em>, on
the first function call, making the load time even faster.</p>

<p>On systems using ELF binaries, this table is called the Procedure
Linkage Table (PLT). The PLT itself doesn’t actually get patched —
it’s mapped read-only along with the rest of the code. Instead the
<em>Global Offset Table</em> (GOT) gets patched. The PLT stub fetches the
dynamic function address from the GOT and <em>indirectly</em> jumps to that
address. To lazily load function addresses, these GOT entries are
initialized with an address of a function that locates the target
symbol, updates the GOT with that address, and then jumps to that
function. Subsequent calls use the lazily discovered address.</p>

<p><img src="/img/diagram/plt.svg" alt="" /></p>

<p>The downside of a PLT is extra overhead per dynamic function call,
which is what shows up in the benchmark. Since the benchmark <em>only</em>
measures function calls, this appears to be pretty significant, but in
practice it’s usually drowned out in noise.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Cleared by an alarm signal. */</span>
<span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">plt_benchmark</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">empty</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <code class="language-plaintext highlighter-rouge">empty()</code> is in the shared object, that call goes through the PLT.</p>

<h3 id="indirect-dynamic-calls">Indirect dynamic calls</h3>

<p>Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
<code class="language-plaintext highlighter-rouge">dlsym(3)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">h</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="s">"path/to/lib.so"</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span>
<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">)</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="s">"f"</span><span class="p">);</span>
<span class="n">f</span><span class="p">();</span>
</code></pre></div></div>

<p>Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then <code class="language-plaintext highlighter-rouge">dlsym(3)</code> may actually return the address of
the PLT stub.)</p>

<p>However, this is still an <em>indirect</em> function call. On conventional
architectures, <em>direct</em> function calls have an immediate relative
address. That is, the target of the call is some hard-coded offset from
the call site. The CPU can see well ahead of time where the call is
going.</p>

<p>An indirect function call has more overhead. First, the address has to
be stored somewhere. Even if that somewhere is just a register, it
increases register pressure by using up a register. Second, it
provokes the CPU’s branch predictor since the call target isn’t
static, making for extra bookkeeping in the CPU. In the worst case the
function call may even cause a pipeline stall.</p>

<p>Here’s the benchmark:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">volatile</span> <span class="kt">sig_atomic_t</span> <span class="n">running</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">long</span>
<span class="nf">indirect_benchmark</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">count</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">running</span><span class="p">;</span> <span class="n">count</span><span class="o">++</span><span class="p">)</span>
        <span class="n">f</span><span class="p">();</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function passed to this benchmark is fetched with <code class="language-plaintext highlighter-rouge">dlsym(3)</code> so the
compiler can’t <a href="/blog/2018/05/01/">do something tricky</a> like convert that indirect
call back into a direct call.</p>

<p>If the body of the loop was complicated enough that there was register
pressure, thereby requiring the address to be spilled onto the stack,
this benchmark might not fare as well against the PLT benchmark.</p>

<h3 id="direct-function-calls">Direct function calls</h3>

<p>The first two types of dynamic function calls are simple and easy to
use. <em>Direct</em> calls to dynamic functions is trickier business since it
requires modifying code at run time. In my benchmark I put together a
<a href="/blog/2015/03/19/">little JIT compiler</a> to generate the direct call.</p>

<p>There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, <code class="language-plaintext highlighter-rouge">empty()</code>. If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.</p>

<p>To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this <a href="/blog/2016/09/03/">requires parsing virtual files under <code class="language-plaintext highlighter-rouge">/proc</code></a>.</p>

<p>Here’s what my JIT’s memory allocation looks like. It assumes
<a href="/blog/2016/05/30/">reasonable behavior for <code class="language-plaintext highlighter-rouge">uintptr_t</code> casts</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="n">jit_func</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">empty</span><span class="p">)(</span><span class="kt">void</span><span class="p">))</span>
<span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span><span class="p">;</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">desired</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="n">addr</span> <span class="o">-</span> <span class="n">SAFETY_MARGIN</span><span class="p">)</span> <span class="o">&amp;</span> <span class="n">PAGEMASK</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">desired</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It allocates two pages, one writable and the other containing
non-writable code. Similar to <a href="/blog/2017/01/08/">my closure library</a>, the lower
page is writable and holds the <code class="language-plaintext highlighter-rouge">running</code> variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">jit_benchmark:</span>
        <span class="nf">push</span>  <span class="nb">rbx</span>
        <span class="nf">xor</span>   <span class="nb">ebx</span><span class="p">,</span> <span class="nb">ebx</span>
<span class="nl">.loop:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">running</span><span class="p">]</span>
        <span class="nf">test</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">je</span>    <span class="nv">.done</span>
        <span class="nf">call</span>  <span class="nv">empty</span>
        <span class="nf">inc</span>   <span class="nb">ebx</span>
        <span class="nf">jmp</span>   <span class="nv">.loop</span>
<span class="nl">.done:</span>  <span class="nf">mov</span>   <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebx</span>
        <span class="nf">pop</span>   <span class="nb">rbx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">call empty</code> is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the <em>end</em> of the instruction):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// call empty</span>
    <span class="kt">uintptr_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">empty</span> <span class="o">-</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">p</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="mh">0xe8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="o">*</span><span class="n">p</span><span class="o">++</span> <span class="o">=</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">;</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">empty()</code> wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for <code class="language-plaintext highlighter-rouge">plt_benchmark()</code>, assuming somehow it didn’t inline
<code class="language-plaintext highlighter-rouge">empty()</code>.</p>

<p>Ironically, calling the JIT-compiled code requires an indirect call
(e.g. via a function pointer), and there’s no way around this. What
are you going to do, JIT compile another function that makes the
direct call? Fortunately this doesn’t matter since the part being
measured in the loop is only a direct call.</p>

<h3 id="its-no-mystery">It’s no mystery</h3>

<p>Given these results, it’s really no mystery that LuaJIT can generate
more efficient dynamic function calls than a PLT, <em>even if they still
end up being indirect calls</em>. In my benchmark, the non-PLT indirect
calls were 28% faster than the PLT, and the direct calls 43% faster
than the PLT. That’s a small edge that JIT-enabled programs have over
plain old native programs, though it comes at the cost of absolutely
no code sharing between processes.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>When the Compiler Bites</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/01/"/>
    <id>urn:uuid:02b974e1-e25b-397d-a16f-c754338e9c1e</id>
    <updated>2018-05-01T23:28:06Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="ai"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Update: There are discussions <a href="https://old.reddit.com/r/cpp/comments/8gfhq3/when_the_compiler_bites/">on Reddit</a> and <a href="https://news.ycombinator.com/item?id=16974770">on Hacker
News</a>.</em></p>

<p>So far this year I’ve been bitten three times by compiler edge cases
in GCC and Clang, each time catching me totally by surprise. Two were
caused by historical artifacts, where an ambiguous specification lead
to diverging implementations. The third was a compiler optimization
being far more clever than I expected, behaving almost like an
artificial intelligence.</p>

<p>In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.</p>

<h3 id="x86-64-abi-ambiguity">x86-64 ABI ambiguity</h3>

<p>The first time I was bit — or, well, narrowly avoided being bit — was
when I examined a missed floating point optimization in both Clang and
GCC. Consider this function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function multiplies its argument by zero and returns the result. Any
number multiplied by zero is zero, so this should always return zero,
right? Unfortunately, no. IEEE 754 floating point arithmetic supports
NaN, infinities, and signed zeros. This function can return NaN,
positive zero, or negative zero. (In some cases, the operation could
also potentially produce a hardware exception.)</p>

<p>As a result, both GCC and Clang perform the multiply:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorpd</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">mulsd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-ffast-math</code> option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
<a href="https://possiblywrong.wordpress.com/2017/09/12/floating-point-agreement-between-matlab-and-c/">consistency</a>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Side note: <code class="language-plaintext highlighter-rouge">-ffast-math</code> doesn’t necessarily mean “less precise.”
Sometimes it will actually <a href="https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add">improve precision</a>.</p>

<p>Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a <code class="language-plaintext highlighter-rouge">short</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply_short</span><span class="p">(</span><span class="kt">short</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s no longer possible for the argument to be one of those special
values. The <code class="language-plaintext highlighter-rouge">short</code> will be promoted to one of 65,535 possible <code class="language-plaintext highlighter-rouge">double</code>
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (<code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">movsx</span>     <span class="nb">edi</span><span class="p">,</span> <span class="nb">di</span>       <span class="c1">; sign-extend 16-bit argument</span>
    <span class="nf">xorps</span>     <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>    <span class="c1">; xmm1 = 0.0</span>
    <span class="nf">cvtsi2sd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nb">edi</span>     <span class="c1">; convert int to double</span>
    <span class="nf">mulsd</span>     <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Clang also misses this optimization:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">cvtsi2sd</span> <span class="nv">xmm1</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">xorpd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">mulsd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (<code class="language-plaintext highlighter-rouge">movsx</code>)? Clang is treating that
<code class="language-plaintext highlighter-rouge">short</code> argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?</p>

<p>It turns out that the <a href="https://www.uclibc.org/docs/psABI-x86_64.pdf">x86-64 ABI</a> didn’t specify what happens with
the upper bits in argument registers. Are they garbage? Are they zeroed?
GCC takes the conservative position of assuming the upper bits are
arbitrary garbage. Clang takes the boldest position of assuming
arguments smaller than 32 bits have been promoted to 32 bits by the
caller. This is what the ABI specification <em>should</em> have said, but
currently it does not.</p>

<p>Fortunately GCC also conservative when passing arguments. It promotes
arguments to 32 bits as necessary, so there are no conflicts when
linking against Clang-compiled code. However, this is not true for
Intel’s ICC compiler: <a href="https://web.archive.org/web/20180908113552/https://stackoverflow.com/a/36760539"><strong>Clang and ICC are not ABI-compatible on
x86-64</strong></a>.</p>

<p>I don’t use ICC, so that particular issue wouldn’t bite me, <em>but</em> if I
was ever writing assembly routines that called Clang-compiled code, I’d
eventually get bit by this.</p>

<h3 id="floating-point-precision">Floating point precision</h3>

<p>Without looking it up or trying it, what does this function return?
Think carefully.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">float_compare</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Confident in your answer? This is a trick question, because it can
return either 0 or 1 depending on the compiler. Boy was I confused when
this comparison returned 0 in my real world code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1
</code></pre></div></div>

<p>So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations <a href="https://news.ycombinator.com/item?id=16974770">all did it differently</a>. The C99 specification
cleaned this all up and introduced <a href="https://en.wikipedia.org/wiki/C99#IEEE_754_floating_point_support"><code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD</code></a>.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.</p>

<p>Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in <code class="language-plaintext highlighter-rouge">long double</code>
precision and truncated afterward (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 2</code>).</p>

<p>In <code class="language-plaintext highlighter-rouge">float_compare()</code> the left-hand side is truncated to a <code class="language-plaintext highlighter-rouge">float</code> by the
assignment, but the right-hand side, <em>despite being a <code class="language-plaintext highlighter-rouge">float</code> literal</em>,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!</p>

<p>The remnants of this high precision trend are still in JavaScript, where
all arithmetic is double precision (even if <a href="http://thibaultlaurens.github.io/javascript/2013/04/29/how-the-v8-engine-works/#more-example-on-how-v8-optimized-javascript-code">simulated using
integers</a>), and great pains have been made <a href="https://blog.mozilla.org/javascript/2013/11/07/efficient-float32-arithmetic-in-javascript/">to work around</a>
the performance consequences of this. <a href="http://tirania.org/blog/archive/2018/Apr-11.html">Until recently</a>, Mono had
similar issues.</p>

<p>The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 0</code>). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323">backwards compatible</a> GCC on the old x86.</p>

<p>I’m a little ashamed that I’m only finding out about this now. However,
by the time I was competent enough to notice and understand this issue,
I was already doing nearly all my programming on the x86-64.</p>

<h3 id="built-in-function-elimination">Built-in Function Elimination</h3>

<p>I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, <code class="language-plaintext highlighter-rouge">new_image()</code>, that allocates a greyscale image
for, say, <a href="/blog/2017/11/03/">some multimedia library</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">shade</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a static function because this would be part of some <a href="https://github.com/nothings/stb">slick
header library</a> (and, secretly, because it’s necessary for
illustrating the issue). Being a responsible citizen, the function
even <a href="/blog/2017/07/19/">checks for integer overflow</a> before allocating anything.</p>

<p>I write a unit test to make sure it detects overflow. This function
should return 0.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_overflow</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far my test passes. Good.</p>

<p>I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make <code class="language-plaintext highlighter-rouge">malloc()</code> fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a <code class="language-plaintext highlighter-rouge">malloc(SIZE_MAX)</code>, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_oom</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I compile with GCC, test passes. I compile with Clang and the test
fails. That is, <strong>the test somehow managed to allocate 16 exibytes of
memory, <em>and</em> initialize it</strong>. Wat?</p>

<p>Disassembling the test reveals what’s going on:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">test_new_image_overflow:</span>
    <span class="nf">xor</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
    <span class="nf">ret</span>

<span class="nl">test_new_image_oom:</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with <code class="language-plaintext highlighter-rouge">malloc()</code> became dead code and
was trivially eliminated.</p>

<p>In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the <code class="language-plaintext highlighter-rouge">memset()</code>, so it eliminated the
allocation altogether and then <em>simulated</em> a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.</p>

<p>I soon realized I can take this further and trick Clang into
performing an invalid optimization, <a href="https://bugs.llvm.org/show_bug.cgi?id=37304">revealing a bug</a>. Consider
this slightly-optimized version that uses <code class="language-plaintext highlighter-rouge">calloc()</code> when the shade is
zero (black). The <code class="language-plaintext highlighter-rouge">calloc()</code> function does its own overflow check, so
<code class="language-plaintext highlighter-rouge">new_image()</code> doesn’t need to do it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">shade</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// shortcut</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">color</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With this change, my overflow unit test is now also failing. The
situation is even worse than before. The <code class="language-plaintext highlighter-rouge">calloc()</code> is being
eliminated <em>despite the overflow</em>, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, <strong>this could introduce a vulnerability in a
real program</strong>. The OpenBSD folks are so worried about this sort of
thing that <a href="https://marc.info/?l=openbsd-cvs&amp;m=150125592126437&amp;w=2">they’ve disabled this optimization</a>.</p>

<p>Here’s a slightly-contrived example of this. Imagine a program that
maintains a table of unsigned integers, and we want to keep track of
how many times the program has accessed each table entry. The “access
counter” table is initialized to zero, but the table of values need
not be initialized, since they’ll be written before first access (or
something like that).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">table</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">counter</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">table_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">table</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Overflow already tested above */</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">free</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">);</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// success</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function relies on the overflow test in <code class="language-plaintext highlighter-rouge">calloc()</code> for the second
<code class="language-plaintext highlighter-rouge">malloc()</code> allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the <code class="language-plaintext highlighter-rouge">counter</code> table, and Clang is able to
statically determine this fact, it may eliminate the <code class="language-plaintext highlighter-rouge">calloc()</code>. This
would also <strong>eliminate the overflow test, introducing a
vulnerability</strong>. If an attacker can control <code class="language-plaintext highlighter-rouge">n</code>, then they can
overwrite arbitrary memory through that <code class="language-plaintext highlighter-rouge">values</code> pointer.</p>

<h3 id="the-takeaway">The takeaway</h3>

<p>Besides this surprising little bug, the main lesson for me is that I
should probably isolate unit tests from the code being tested. The
easiest solution is to put them in separate translation units and don’t
use link-time optimization (LTO). Allowing tested functions to be
inlined into the unit tests is probably a bad idea.</p>

<p>The unit test issues in my <em>real</em> program, which was <a href="https://github.com/skeeto/growable-buf">a bit more
sophisticated</a> than what was presented here, gave me artificial
intelligence vibes. It’s that situation where a computer algorithm did
something really clever and I felt it outsmarted me. It’s creepy to
consider <a href="https://wiki.lesswrong.com/wiki/Paperclip_maximizer">how far that can go</a>. I’ve gotten that even from
observing <a href="/blog/2017/04/27/">AI I’ve written myself</a>, and I know for sure no human
taught it some particularly clever trick.</p>

<p>My favorite AI story along these lines is about <a href="https://www.youtube.com/watch?v=xOCurBYI_gY">an AI that learned
how to play games on the Nintendo Entertainment System</a>. It
didn’t understand the games it was playing. It’s optimization task was
simply to choose controller inputs that maximized memory values,
because that’s generally associated with doing well — higher scores,
more progress, etc. The most unexpected part came when playing Tetris.
Eventually the screen would fill up with blocks, and the AI would face
the inevitable situation of losing the game, with all that memory
being reinitialized to low values. So what did it do?</p>

<p>Just before the end it would pause the game and wait… forever.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Blast from the Past: Borland C++ on Windows 98</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/04/13/"/>
    <id>urn:uuid:298d2dbe-31eb-30c4-21c2-e019dc5449f6</id>
    <updated>2018-04-13T20:01:31Z</updated>
    <category term="vim"/><category term="c"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>My first exposure to C and C++ was a little over 20 years ago. I
remember it being some version of <a href="https://en.wikipedia.org/wiki/Borland_C%2B%2B">Borland C++</a>, either 4.x
or 5.x, running on Windows 95. I didn’t have <a href="/blog/2016/09/02/">a mentor</a>, so I
did the best I could slowly working through what was probably a poorly
written beginner C++ book, typing out the examples and exercises with
little understanding. Since I didn’t learn much from the experience,
there was a 7 or 8 year gap before I’d revisit C and C++ in college.</p>

<p><a href="/img/win98/enchive.png"><img src="/img/win98/enchive-thumb.png" alt="" /></a></p>

<p>I thought it would be interesting to revisit this software, to
reevaluate it from a far more experienced perspective. Keep in mind
that C++ wasn’t even standardized yet, and the most recent C standard
was from 1989. Given this, what was it like to be a professional
software developer using a Borland toolchain on Windows 20 years ago?
Was it miserable, made bearable only by ignorance of how much better
the tooling could be? Or maybe it actually wasn’t so bad, and these
tools are better than I expect?</p>

<p>Ultimately my conclusion is that it’s a little bit of both. There are
some significant capability gaps compared to today, but the core
toolchain itself is actually quite reasonable, especially for the mid
1990s.</p>

<h3 id="the-setup">The setup</h3>

<p>Before getting into the evaluation, let’s discuss how I got it all up
and running. While it’s <em>technically</em> possible to run Windows 95 on a
modern x86-64 machine thanks to <a href="/blog/2014/12/09/">the architecture’s extreme backwards
compatibility</a>, it’s more compatible, simpler, and safer to
virtualize it. Most importantly, I can emulate older hardware that
will have better driver support.</p>

<p>Despite that early start in Windows all those years ago, I’m primarily
a Linux user. The premier virtualization solution on Linux these days
is KVM, a kernel module <a href="https://www.redhat.com/en/topics/virtualization/what-is-KVM">that turns Linux into a hypervisor</a> and
makes efficient use of hardware virtualization extensions.
Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m
using <a href="https://www.qemu.org/">QEmu</a> (with KVM disabled), a hardware emulator closely
associated with KVM. Since it doesn’t take advantage of hardware
virtualization extensions, it will be slower. This is fine since my
goal is to emulate slow, 20+ year old hardware anyway.</p>

<p>There’s very little practical difference between Windows 95 and
Windows 98. Since Windows 98 runs a lot smoother virtualized, I
decided to go with that instead. This will be perfectly sufficient for
my toolchain evaluation.</p>

<h4 id="software">Software</h4>

<p>To get started, I’ll need an installer for Windows 98. I thought this
would be difficult to find, but there’s a copy available on the
Internet Archive. I don’t know how “legitimate” this is, but it works.
Since it’s running in a virtual machine without network access, I also
don’t really care if this copy is somehow infected with malware.</p>

<p>Internet Archive: <a href="https://archive.org/details/win98se_201607">Windows 98 Second Edition</a></p>

<p>Also on the Internet Archive is a complete copy of Borland C++ 5.02,
with the same caveats of legitimacy. It works, which is good enough for
my purposes.</p>

<p>Internet Archive: <a href="https://archive.org/details/BorlandC5.02">Borland C++ 5.02</a></p>

<p>Thank you Internet Archive!</p>

<h4 id="hardware">Hardware</h4>

<p>I’ve got my software, now to set up the virtualized hardware. First I
create a drive image:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qemu-image create -fqcow2 win98.img 8G
</code></pre></div></div>

<p>I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a
virtual hard drive with modern sizes would probably break the
installer. This sort of issue is a common theme among old software,
where there may be complaints about negative available disk space due
to signed integer overflow.</p>

<p>I decided to give the machine 256MB of memory (<code class="language-plaintext highlighter-rouge">-m 256</code>). This is also a
little excessive, but I wanted to be sure memory didn’t limit Borland’s
capabilities. This amount of memory is close to the upper bound, and
going much beyond will likely cause problems with Windows 98.</p>

<p>For the CPU I settled on a Pentium (<code class="language-plaintext highlighter-rouge">-cpu pentium</code>). My original goal
was to go a little simpler with a 486 (<code class="language-plaintext highlighter-rouge">-cpu 486</code>), but the Windows 98
installer kept crashing when I tried this.</p>

<p>I experimented with different configurations for the network card, but
I couldn’t get anything to work. So I’ve disabled networking (<code class="language-plaintext highlighter-rouge">-net
none</code>). The only reason I’d want this is that it would be easier to
move files in and out of the virtual machine.</p>

<p>Finally, here’s how I ran QEmu. The last two lines are only needed when
installing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qemu-system-x86_64 \
    -localtime \
    -cpu pentium \
    -no-acpi \
    -no-hpet \
    -m 256 \
    -hda win98.img \
    -soundhw sb16 \
    -vga cirrus \
    -net none \
    -cdrom "Windows 98 Second Edition.iso" \
    -boot d
</code></pre></div></div>

<p><a href="/img/win98/install.png"><img src="/img/win98/install-thumb.png" alt="" /></a></p>

<h4 id="installation">Installation</h4>

<p>Installation is just a matter of following the instructions. You’ll
need that product key listed on the Internet Archive site.</p>

<p><a href="/img/win98/base.png"><img src="/img/win98/base-thumb.png" alt="" /></a></p>

<p>That copy of Borland is just a big .zip file. This presents two
problems.</p>

<ol>
  <li>
    <p>Without network access, I’ll need to figure out how to get this
inside the virtual machine.</p>
  </li>
  <li>
    <p>This version of Windows doesn’t come with software to unzip this
file. I’d need to find and install an unzip tool first.</p>
  </li>
</ol>

<p>Fortunately I can kill two birds with one stone by converting that .zip
archive into a .iso and mounting it in the virtual machine.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"
</code></pre></div></div>

<p>Then in the QEmu console (<kbd>C-A-2</kbd>) I attach it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>change ide1-cd0 borland.iso
</code></pre></div></div>

<p>This little trick of generating .iso files and mounting them is how I
will be moving all the other files into the virtual machine.</p>

<h3 id="borland-c">Borland C++</h3>

<p>The first thing I did was play around with with Borland IDE. This is
what I would have been using 20 years ago.</p>

<p><a href="/img/win98/ide.png"><img src="/img/win98/ide-thumb.png" alt="" /></a></p>

<p>Despite being Borland <em>C++</em>, I’m personally most interested in its ANSI
C compiler. As I already pointed out, this software pre-dates C++’s
standardization, and a lot has changed over the past two decades. On the
other hand, C <em>hasn’t really changed all that much</em>. The 1999 update to
the C standard (e.g. “C99”) was big and important, but otherwise little
has changed. The biggest drawback is the lack of “declare anywhere”
variables, including in for-loop initializers. Otherwise it’s the same
as writing C today.</p>

<p>To test drive the IDE, I made a couple of test projects, built and ran
them with different options, and poked around with the debugger. The
debugger is actually pretty decent, especially for the 1990s. It can be
operated via the IDE or standalone, so I could use it without firing up
the IDE and making a project.</p>

<p>The toolchain includes an assembler, and I can inspect the compiler’s
assembly output. To nobody’s surprise this is Intel-flavored assembly,
which <a href="http://x86asm.net/articles/what-i-dislike-about-gas/">is very welcome</a>. Imagining myself as a software developer
in the mid 1990s, this means I can see exactly what the compiler’s doing
as well as write some of the performance sensitive parts in assembly if
necessary.</p>

<p>The built-in editor is the worst part of the IDE, which is unfortunate
since it really spoils the whole experience. It’s easy to jump between
warnings and errors, it has incremental search, and it has good syntax
highlighting. But these are the only positive things I can say about it.
If I had to work with this editor full-time, I’d spend my days pretty
irritated.</p>

<h3 id="switch-to-command-line-tools">Switch to command line tools</h3>

<p>Like with the debugger, the Borland people did a good job modularizing
their development tools. As part of the installation process, all of the
Borland command line tools are added to the system <code class="language-plaintext highlighter-rouge">PATH</code> (reminder:
this is a single-user system). This includes compiler, linker,
assembler, debugger, and even an <a href="/blog/2017/08/20/">incomplete implementation</a> of
<code class="language-plaintext highlighter-rouge">make</code>.</p>

<p>With this, I can essentially pretend the IDE doesn’t exist and replace
that crummy editor with something better: Vim.</p>

<p>The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3,
released in 2010. I download those binaries, trim a few things from my
<a href="https://github.com/skeeto/dotfiles/blob/master/_vimrc">.vimrc</a>, and smuggle it all into my virtual machine via a
virtual CD. I’ve now got a powerful text editor in Windows 98 and my
situation has drastically improved.</p>

<p><a href="/img/win98/vim.png"><img src="/img/win98/vim-thumb.png" alt="" /></a></p>

<p>Since I hardly use features added since Vim 7.3, this feels <a href="/blog/2017/04/01/">right at
home</a> to me. I can <a href="/blog/2017/08/22/">invoke the build</a> from Vim, and it
can populate the quickfix list from Borland’s output, so I could
actually be fairly productive in these circumstances! I’m honestly
really impressed with how well this all works together.</p>

<p>At this point I only have two significant annoyances:</p>

<ol>
  <li>
    <p>Borland’s command line tools belong to that category of irritating
programs that print their version banner on every invocation.
There’s not even a command line switch to turn this off. All this
noise is quickly tiresome. The <a href="/blog/2016/06/13/">Visual Studio toolchain</a> does
the same thing by default, though it can be turned off (<code class="language-plaintext highlighter-rouge">-nologo</code>).
I dislike that some GNU tools also commit this sin, but at least
GNU limits this to interactive programs.</p>
  </li>
  <li>
    <p>The Windows/DOS command shell and console is <em>even worse</em> <a href="/blog/2017/11/30/">than it
is today</a>. I didn’t think that was possible. This is back when
it was still genuinely DOS and not just pretending to be (e.g. in
NT). The worst part by far is the lack of command history. There’s
no using the up-arrow to get previous commands. There’s no tab
completion. Forward slash is not a substitute for backslash in
paths. If I wanted to improve my productivity, replacing this
console and shell would be the first priority.</p>
  </li>
</ol>

<p><strong>Update</strong>: In an email, Aristotle Pagaltzis informed me that Windows 98
comes with <a href="https://en.wikipedia.org/wiki/DOSKEY">DOSKEY.COM</a>, which provides command history for
COMMAND.EXE. Alternatively there’s <a href="http://paulhoule.com/doskey/">Enhanced DOSKEY.com</a>, an
open source, alternative implementation that also provides tab
completion for commands and filesnames. This makes the console a lot
more usable (and, honestly, in some ways better than the modern
defaults).</p>

<h3 id="building-enchive-with-borland">Building Enchive with Borland</h3>

<p>Last year I wrote <a href="/blog/2017/03/12/">a backup encryption tool called Enchive</a>,
and I still use it regularly. One of my design goals was high
portability since it may be needed to decrypt something important in
the distant future. It should be as <a href="https://en.wikipedia.org/wiki/Software_rot">bit-rot</a>-proof as
possible. <strong>In software, the best way to <em>future</em>-proof is to
<em>past</em>-proof.</strong></p>

<p>If I had a time machine that could send source code back in time, and
I sent Enchive to a competant developer 20 years ago, would they be
able to compile it and run it? If the answer is yes, then that means
Enchive already has 20 years of future-proofing built into it.</p>

<p>To accomplish this, Enchive is 3,300 lines of strict ANSI C,
1989-style, with no dependencies other than the C standard library and
a handful of operating system functions — e.g. functionality not in
the C standard library. In practice, any ANSI C compiler targeting
either POSIX, or Windows 95 or later, should be able to compile it.</p>

<p>My Windows 98 virtual machine includes an ANSI C compiler, and can be
used to simulate this time machine. I generated an “amalgamation” build
(<code class="language-plaintext highlighter-rouge">make amalgamation</code>) — essentially a concatenation of all the source
files — and sent this into the virtual machine. Before Borland was able
to compile it, I needed to make three small changes.</p>

<p>First, Enchive includes <code class="language-plaintext highlighter-rouge">stdint.h</code> to get fixed-width integers needed
for the encryption routines. This header comes from C99, and C89 has
no equivalent. I anticipated this problem from the beginning and made
it easy for the person performing the build to correct it. This header
is included exactly once, in <code class="language-plaintext highlighter-rouge">config.h</code>, and this is placed at the top
of the amalgamation build. The include only needs to be replaced with
a handful of manual typedefs. For Borland that looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">char</span>    <span class="kt">uint8_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">short</span>   <span class="kt">uint16_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span>    <span class="kt">uint32_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="kt">unsigned</span> <span class="n">__int64</span> <span class="kt">uint64_t</span><span class="p">;</span>

<span class="k">typedef</span> <span class="kt">long</span>             <span class="kt">int32_t</span><span class="p">;</span>
<span class="k">typedef</span> <span class="n">__int64</span>          <span class="kt">int64_t</span><span class="p">;</span>

<span class="cp">#define INT8_C(n)   (n)
#define INT16_C(n)  (n)
#define INT32_C(n)  (n##U)
</span></code></pre></div></div>

<p>Second, in more recent versions of Windows, <code class="language-plaintext highlighter-rouge">GetFileAttributes()</code> can
return the value <code class="language-plaintext highlighter-rouge">INVALID_FILE_ATTRIBUTES</code>. Checking for an error that
cannot happen is harmless, but this value isn’t defined in Borland’s
SDK. I only had to eliminate that check.</p>

<p>Third, the <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa379942(v=vs.85).aspx"><code class="language-plaintext highlighter-rouge">CryptGenRandom()</code></a> interface isn’t defined in
Borland’s SDK. This is used by Enchive to generate keys. MSDN reports
this function wasn’t available until Windows XP, but it’s definitely
there in Windows 98, exported by ADVAPI32.dll. I’m able to call it,
though it always reports an error. Perhaps it’s been disabled in this
version due to <a href="https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States">cryptographic export restrictions</a>?</p>

<p>Regardless of what’s wrong, I ripped this out and replaced it with a
fatal error. This version of Enchive can’t generate new keys — unless
derived from a passphrase — nor encrypt files, including the use of a
protection key to encrypt the secret key. However, it <em>can</em> decrypt
files, which is the important part that needs to be future-proofed.</p>

<p>With this three changes — which took me about 10 minutes to sort out —
Enchive builds and runs, and it correctly decrypts files I encrypted on
Linux. So Enchive has at least 20 years of past-proofing! The
screenshot at the top of this article shows it running successfully in
an MS-DOS console window.</p>

<h3 id="whats-wrong-whats-missing">What’s wrong? What’s missing?</h3>

<p>I mentioned that there were some gaps. The most obvious is the lack of
the standard POSIX utilities, especially a decent shell. I don’t know if
any had been ported to Windows in the mid 1990s. But that could be
solved one way or another without too much trouble, even if it meant
doing some of that myself.</p>

<p>No, the biggest capability I’d miss, and which wouldn’t be easily
obtained, is Git, or a least a decent source control system. I really
don’t want to work without proper source control. Git’s support for
Windows is second tier, and the port to modern Windows is already a
bit of a hack. Getting it to run in Windows 98 would probably be a
challenge, especially if I had to compile it with Borland.</p>

<p>The other major issue is the lack of stability. In this experiment, I’ve
been seeing this screen <em>a lot</em>:</p>

<p><a href="/img/win98/bsod.png"><img src="/img/win98/bsod-thumb.png" alt="" /></a></p>

<p>I remember Windows crashing a lot back in those days, and it certainly
had a bad reputation for being unstable, but this is far worse than I
remembered. While the hardware emulator may be <em>somewhat</em> at fault here,
keep in mind that I never installed third party drivers. Most of these
crashes are Windows’ fault. I found I can reliably bring the whole
system down with a single <code class="language-plaintext highlighter-rouge">GetProcAddress()</code> call on a system DLL. The
only way I can imagine this instability was so tolerated back then was
general ignorance that computing could be so much better.</p>

<p>I was tempted to write this article in Vim on Windows 98, but all this
crashing made me too nervous. I didn’t want some stupid filesystem
corruption to wipe out my work. Too risky.</p>

<h3 id="a-better-alternative">A better alternative</h3>

<p>If I was stuck working in Windows 98 — or was at least targeting it as a
platform — but had access to a modern tooling ecosystem, could I do
better than Borland? Yes! Programs built by <a href="https://mingw-w64.org/doku.php">Mingw-w64</a> can be
run even as far back as Windows 95.</p>

<p>Now, there’s a catch. I thought it would be this simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i686-w64-mingw32-gcc -Os hello.c
</code></pre></div></div>

<p>But when I brought the resulting binary into the virtual machine it
crashed when ran it: illegal instruction. Turns out it contained a
conditional move (<code class="language-plaintext highlighter-rouge">cmov</code>) which is an instruction not available until
the Pentium Pro (686). The “pentium” emulation is just a 586.</p>

<p>I tried to disable <code class="language-plaintext highlighter-rouge">cmov</code> by picking the specific architecture:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i686-w64-mingw32-gcc -march=pentium -Os hello.c
</code></pre></div></div>

<p>This still didn’t work because the statically-linked part of the CRT
contained the <code class="language-plaintext highlighter-rouge">cmov</code>. I’d have to recompile that as well.</p>

<p>I could have switched the QEmu options to “upgrade” to a Pentium Pro,
but remember that my goal was really the 486. Fortunately this was easy
to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number
of times before, so I knew it wouldn’t be difficult.</p>

<p>I could go step by step, but it’s all fairly well documented in the
Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version),
and for the target I picked “i486-w64-mingw32”. When it was done I could
compile binaries on Linux to run in my Windows 98 virtual machine:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ i486-w64-mingw32-gcc -Os hello.c
</code></pre></div></div>

<p>This should enable quite a bit of modern software to run inside my
virtual machine if I so wanted. I didn’t actually try this (yet?),
but, to take this concept all the way, I could use this cross-compiler
to cross-compile Mingw-w64 itself to run inside the virtual machine,
directly replacing Borland C++.</p>

<p>And the only thing I’d miss about Borland is its debugger.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>A Crude Personal Package Manager</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/03/27/"/>
    <id>urn:uuid:b100f50f-c8f8-3a08-e149-a04b2308226b</id>
    <updated>2018-03-27T02:10:35Z</updated>
    <category term="c"/><category term="posix"/><category term="linux"/><category term="bsd"/>
    <content type="html">
      <![CDATA[<p>For the past couple of months I’ve been using a custom package manager
to manage a handful of software packages within various unix-like
environments. Packages are <a href="/blog/2017/06/19/">installed in my home directory</a> under
<code class="language-plaintext highlighter-rouge">~/.local/bin</code>, and the package manager itself is just a 110 line Bourne
shell script. It’s is not intended to replace the system’s package
manager but, instead, compliment it in some cases where I need more
flexibility. I use it to run custom versions of specific pieces of
software — newer or older than the system-installed versions, or with my
own patches and modifications — without interfering with the rest of
system, and without a need for root access. It’s worked out <em>really</em>
well so far and I expect to continue making heavy use of it in the
future.</p>

<p>It’s so simple that I haven’t even bothered putting the script in its
own repository. It sits unadorned within my dotfiles repository with the
name <em>qpkg</em> (“quick package”):</p>

<ul>
  <li><a href="https://github.com/skeeto/dotfiles/blob/master/bin/qpkg">https://github.com/skeeto/dotfiles/blob/master/bin/qpkg</a></li>
</ul>

<p>Sitting alongside my dotfiles means it’s always there when I need it,
just as if it was a built-in command.</p>

<p>I say it’s crude because its “install” (<code class="language-plaintext highlighter-rouge">-I</code>) procedure is little more
than a wrapper around tar. It doesn’t invoke libtool after installing a
library, and there’s no post-install script — or <code class="language-plaintext highlighter-rouge">postinst</code> as Debian
calls it. It doesn’t check for conflicts between packages, though
there’s a command for doing so manually ahead of time. It doesn’t manage
dependencies, nor even have them as a concept. That’s all on the user to
screw up.</p>

<p>In other words, it doesn’t attempt solve most of the hard problems
tackled by package managers… <em>except</em> for three important issues:</p>

<ol>
  <li>
    <p>It provides a clean, guaranteed-to-work uninstall procedure. Some
Makefiles <em>do</em> have a token “uninstall” target, but it’s often
unreliable.</p>
  </li>
  <li>
    <p>Unlike blindly using a Makefile “install” target, I can check for
conflicts <em>before</em> installing the software. I’ll know if and how a
package clobbers an already-installed package, and I can manage, or
ignore, that conflict manually as needed.</p>
  </li>
  <li>
    <p>It produces a compact, reusable package file that I can reinstall
later, even on a different machine (with a couple of caveats). I
don’t need to keep around the original source and build directories
should I want to install or uninstall later. I can also rapidly
switch back and forth between different builds of the same software.</p>
  </li>
</ol>

<p>The first caveat is that the package will be configured for exactly my
own home directory, so I usually can’t share it with other users, or
install it on machines where I have a different home directory. Though I
could still create packages for different installation prefixes.</p>

<p>The second caveat is that some builds tailor themselves by default to
the host (e.g. <code class="language-plaintext highlighter-rouge">-march=native</code>). If care isn’t taken, those packages may
not be very portable. This is more common than I had expected and has
mildly annoyed me.</p>

<h3 id="birth-of-a-package-manager">Birth of a package manager</h3>

<p>While the package manager is new, I’ve been building and installing
software in my home directory for years. I’d follow the normal process
of setting the install <em>prefix</em> to <code class="language-plaintext highlighter-rouge">$HOME/.local</code>, running the build,
and then letting the “install” target do its thing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
</code></pre></div></div>

<p>This worked well enough for years. However, I’ve come to rely a lot on
this technique, and I’m using it for increasingly sophisticated
purposes, such as building custom cross-compiler toolchains.</p>

<p>A common difficulty has been handling the release of new versions of
software. I’d like to upgrade to the new version, but lack a way to
cleanly uninstall the previous version. Simply clobbering the old
version by installing it on top <em>usually</em> works. Occasionally it
wouldn’t, and I’d have to blow away <code class="language-plaintext highlighter-rouge">~/.local</code> and start all over again.
With more and more software installed in my home directory, restarting
has become more and more of a chore that I’d like to avoid.</p>

<p>What I needed was a way to track exactly which files were installed so
that I could remove them later when I needed to uninstall. Fortunately
there’s a widely-used convention for exactly this purpose: <code class="language-plaintext highlighter-rouge">DESTDIR</code>.</p>

<p>It’s expected that when a Makefile provides an “install” target, it
prefixes the installation path with the <code class="language-plaintext highlighter-rouge">DESTDIR</code> macro, which is
assigned to the empty string by default. This allows the user to install
the software to a temporary location for the purposes of packaging.
Unlike the installation prefix (<code class="language-plaintext highlighter-rouge">--prefix</code>) configured before the build
began, the software is not expected to function properly when run in the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> location.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install
</code></pre></div></div>

<p>A different tool will used to copy these files into place and actually
install it. This tool can track what files were installed, allowing them
to be removed later when uninstalling. My package manager uses the tar
program for both purposes. First it creates a package by packing up the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> (at the root of the actual install prefix):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar czf package.tgz -C $DESTDIR$HOME/.local .
</code></pre></div></div>

<p>So a package is nothing more than a gzipped tarball. To install, it
unpacks the tarball in <code class="language-plaintext highlighter-rouge">~/.local</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd $HOME/.local
$ tar xzf ~/package.tgz
</code></pre></div></div>

<p>But how does it uninstall a package? It didn’t keep track of what was
installed. Easy! The tarball itself contains the package list, and it’s
printed with tar’s <code class="language-plaintext highlighter-rouge">t</code> mode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done
</code></pre></div></div>

<p>I’m using <code class="language-plaintext highlighter-rouge">grep</code> to skip directories, which are conveniently listed with
a trailing slash. Note that in the example above, there are a couple of
issues with file names containing whitespace. If the file contains a
space character, it will word split incorrectly in the <code class="language-plaintext highlighter-rouge">for</code> loop. A
Makefile couldn’t handle such a file in the first place, but, in case
it’s still necessary, my package manager sets <code class="language-plaintext highlighter-rouge">IFS</code> to just a newline.</p>

<p>If the file name contains a newline, then my package manager relies on
<a href="http://dinaburg.org/bitsquatting.html">a cosmic ray striking just the right bit at just the right
instant</a> to make it all work out, because no version of tar can
unambiguously print such file names. Crossing your fingers during this
process may help.</p>

<h3 id="commands">Commands</h3>

<p>There are five commands, each assigned to a capital letter: <code class="language-plaintext highlighter-rouge">-B</code>, <code class="language-plaintext highlighter-rouge">-C</code>,
<code class="language-plaintext highlighter-rouge">-I</code>, <code class="language-plaintext highlighter-rouge">-V</code>,  and <code class="language-plaintext highlighter-rouge">-U</code>. It’s an interface pattern inspired by <a href="https://www.openbsd.org/papers/bsdcan-signify.html">Ted
Unangst’s signify</a> (see <a href="https://man.openbsd.org/signify.1"><code class="language-plaintext highlighter-rouge">signify(1)</code></a>). I also used this
pattern with <a href="/blog/2017/09/15/">Blowpipe</a> and, in retrospect, wish I had also used
with <a href="/blog/2017/03/12/">Enchive</a>.</p>

<h4 id="build--b">Build (<code class="language-plaintext highlighter-rouge">-B</code>)</h4>

<p>Unlike the other three commands, the “build” command isn’t essential,
and is just for convenience. It assumes the build uses an Autoconfg-like
configure script and runs it automatically, followed by <code class="language-plaintext highlighter-rouge">make</code> with the
appropriate <code class="language-plaintext highlighter-rouge">-j</code> (jobs) option. It automatically sets the <code class="language-plaintext highlighter-rouge">--prefix</code>
argument when running the configure script.</p>

<p>If the build uses something other and an Autoconf-like configure script,
such as CMake, then you can’t use the “build” command and must perform
the build yourself. For example, I must do this when building LLVM and
Clang.</p>

<p>Before using the “build” command, the package must first be unpacked and
patched if necessary. Then the package manager can take over to run the
build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 &lt; ../0001.patch
$ patch -p1 &lt; ../0002.patch
$ patch -p1 &lt; ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/
</code></pre></div></div>

<p>In this example I’m doing an out-of-source build by invoking the
configure script from a different directory. Did you know Autoconf
scripts support this? I didn’t know until recently! Unfortunately some
hand-written Autoconf-like scripts don’t, though this will
be immediately obvious.</p>

<p>Once <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the program will be fully built — or stuck on a
build error if you’re unlucky. If you need to pass custom configure
options, just tack them on the <code class="language-plaintext highlighter-rouge">qpkg</code> command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses
</code></pre></div></div>

<p>Since the second and third steps — creating the build directory and
moving into it — is so common, there’s an optional switch for it: <code class="language-plaintext highlighter-rouge">-d</code>.
This option’s argument is the build directory. <code class="language-plaintext highlighter-rouge">qpkg</code> creates that
directory and runs the build inside it. In practice I just use “x” for
the build directory since it’s so quick to add “dx” to the command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/
</code></pre></div></div>

<p>With the software compiled, the next step is creating the package.</p>

<h4 id="create--c">Create (<code class="language-plaintext highlighter-rouge">-C</code>)</h4>

<p>The “create” command creates the <code class="language-plaintext highlighter-rouge">DESTDIR</code> (<code class="language-plaintext highlighter-rouge">_destdir</code> in the working
directory) and runs the “install” Makefile target to fill it with files.
Continuing with the example above and its <code class="language-plaintext highlighter-rouge">x/</code> build directory:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -Cdx name
</code></pre></div></div>

<p>Where “name” is the name of the package, without any file name
extension. Like with “build”, extra arguments after the package name are
passed to <code class="language-plaintext highlighter-rouge">make</code> in case there needs to be any additional tweaking.</p>

<p>When the “create” command finishes, there will be new package named
<code class="language-plaintext highlighter-rouge">name.tgz</code> in the working directory. At this point the source and build
directories are no longer needed, assuming everything went fine.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ rm -rf name-version/
$ rm -rf x/
</code></pre></div></div>

<p>This package is ready to install, though you may want to verify it
first.</p>

<h4 id="verify--v">Verify (<code class="language-plaintext highlighter-rouge">-V</code>)</h4>

<p>The “verify” command checks for collisions against installed packages.
It works like uninstallation, but rather than deleting files, it checks
if any of the files already exist. If they do, it means there’s a
conflict with an existing package. These file names are printed.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -V name.tgz
</code></pre></div></div>

<p>The most common conflict I’ve seen is in the info index (<code class="language-plaintext highlighter-rouge">info/dir</code>)
file, which is safe to ignore since I don’t care about it.</p>

<p>If the package has already been installed, there will of course be tons
of conflicts. This is the easiest way to check if a package has been
installed.</p>

<h4 id="install--i">Install (<code class="language-plaintext highlighter-rouge">-I</code>)</h4>

<p>The “install” command is just the dumb <code class="language-plaintext highlighter-rouge">tar xzf</code> explained above. It
will clobber anything in its way without warning, which is why, if that
matters, “verify” should be used first.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -I name.tgz
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">qpkg</code> returns, the package has been installed and is probably
ready to go. A lot of packages complain that you need to run libtool to
finalize an installation, but I’ve never had a problem skipping it. This
dumb unpacking generally works fine.</p>

<h4 id="uninstall--u">Uninstall (<code class="language-plaintext highlighter-rouge">-U</code>)</h4>

<p>Obviously the last command is “uninstall”. As explained above, this
needs the original package that was given to the “install” command.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ qpkg -U name.tgz
</code></pre></div></div>

<p>Just as “install” is dumb, so is “uninstall,” blindly deleting anything
listed in the tarball. One thing I like about dumb tools is that there
are no surprises.</p>

<p>I typically suffix the package name with the version number to help keep
the packages organized. When upgrading to a new version of a piece of
software, I build the new package, which, thanks to the version suffix,
will have a distinct name. Then I uninstall the old package, and,
finally, install the new one in its place. So far I’ve been keeping the
old package around in case I still need it, though I could always
rebuild it in a pinch.</p>

<h3 id="package-by-accumulation">Package by accumulation</h3>

<p>Building a GCC cross-compiler toolchain is a tricky case that doesn’t
fit so well with the build, create, and install process illustrated
above. It would be nice for the cross-compiler to be a single, big
package, but due to the way it’s built, it would need to be five or so
packages, a couple of which will conflict (one being a subset of
another):</p>

<ol>
  <li>binutils</li>
  <li>C headers</li>
  <li>core GCC</li>
  <li>C runtime</li>
  <li>rest of GCC</li>
</ol>

<p>Each step needs to be installed before the next step will work. (I don’t
even want to think about cross-compiling a cross-compiler.)</p>

<p>To deal with this, I added a “keep” (<code class="language-plaintext highlighter-rouge">-k</code>) option that leaves the
<code class="language-plaintext highlighter-rouge">DESTDIR</code> around after creating the package. To keep things tidy, the
intermediate packages exist and are installed, but the final, big
cross-compiler package <em>accumulates</em> into the <code class="language-plaintext highlighter-rouge">DESTDIR</code>. The final
package at the end is actually the whole cross compiler in one package,
a superset of them all.</p>

<p>Complicated situations like these are where I can really understand the
value of Debian’s <a href="https://wiki.debian.org/FakeRoot">fakeroot</a> tool.</p>

<h3 id="my-use-case-and-an-alternative">My use case, and an alternative</h3>

<p>The role filled by my package manager is actually pretty well suited for
<a href="https://www.pkgsrc.org/">pkgsrc</a>, which is NetBSD’s ports system made available to other
unix-like systems. However, I just need something really lightweight
that gives me absolute control — even more than I get with pkgsrc — in
the dozen or so cases where I <em>really</em> need it.</p>

<p>All I need is a standard C toolchain in a unix-like environment (even a
really old one), the source tarballs for the software I need, my 110
line shell script package manager, and one to two cans of elbow grease.
From there I can bootstrap everything I might need without root access,
even <a href="/blog/2017/04/01/">in a disaster</a>. If the software I need isn’t written in C, it
can ultimately get bootstrapped from some crusty old C compiler, which
might even involve building some newer C compilers in between. After a
certain point it’s C all the way down.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Inspiration from Data-dependent Rotations</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/02/07/"/>
    <id>urn:uuid:917b72f1-3aad-3af3-075a-a4b0a6eb8a4d</id>
    <updated>2018-02-07T23:59:59Z</updated>
    <category term="c"/><category term="crypto"/><category term="meta"/>
    <content type="html">
      <![CDATA[<p>This article is an expanded email I wrote in response to a question
from Frank Muller. He asked how I arrived at my solution to a
<a href="/blog/2017/10/06/">branchless UTF-8 decoder</a>:</p>

<blockquote>
  <p>I mean, when you started, I’m pretty the initial solution was using
branches, right? Then, you’ve applied some techniques to eliminate
them.</p>
</blockquote>

<p>A bottom-up approach that begins with branches and then proceeds to
eliminate them one at a time sounds like a plausible story. However,
this story is the inverse of how it actually played out. It began when I
noticed a branchless decoder could probably be done, then I put together
the pieces one at a time without introducing any branches. But what
sparked that initial idea?</p>

<p>The two prior posts reveal my train of thought at the time: <a href="/blog/2017/09/15/">a look at
the Blowfish cipher</a> and <a href="/blog/2017/09/21/">a 64-bit PRNG shootout</a>. My
layman’s study of Blowfish was actually part of an examination of a
number of different block ciphers. For example, I also read the NSA’s
<a href="http://eprint.iacr.org/2013/404.pdf">Speck and Simon paper</a> and then <a href="https://github.com/skeeto/scratch/tree/master/speck">implemented the 128/128 variant
of Speck</a> — a 128-bit key and 128-bit block. I didn’t take the
time to write an article about it, but note how the entire cipher —
key schedule, encryption, and decryption — is just 40 lines of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">speck</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">k</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
<span class="p">};</span>

<span class="kt">void</span>
<span class="nf">speck_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">speck</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">x</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">uint64_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">31</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">);</span>
        <span class="n">x</span> <span class="o">+=</span> <span class="n">y</span><span class="p">;</span>
        <span class="n">x</span> <span class="o">^=</span> <span class="n">i</span><span class="p">;</span>
        <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">&lt;&lt;</span> <span class="mi">3</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
        <span class="n">y</span> <span class="o">^=</span> <span class="n">x</span><span class="p">;</span>
        <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">speck_encrypt</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">speck</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">32</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span><span class="p">);</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">+=</span> <span class="o">*</span><span class="n">y</span><span class="p">;</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">^=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="o">*</span><span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">y</span> <span class="o">&lt;&lt;</span> <span class="mi">3</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="o">*</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
        <span class="o">*</span><span class="n">y</span> <span class="o">^=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">speck_decrypt</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">speck</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">32</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">y</span> <span class="o">^=</span> <span class="o">*</span><span class="n">x</span><span class="p">;</span>
        <span class="o">*</span><span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">3</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="o">*</span><span class="n">y</span> <span class="o">&lt;&lt;</span> <span class="mi">61</span><span class="p">);</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">^=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">[</span><span class="mi">31</span> <span class="o">-</span> <span class="n">i</span><span class="p">];</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">-=</span> <span class="o">*</span><span class="n">y</span><span class="p">;</span>
        <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">56</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Isn’t that just beautiful? It’s so tiny and fast. Other than the
not-very-arbitrary selection of 32 rounds, and the use of 3-bit and
8-bit rotations, there are no magic values. One could fairly
reasonably commit this cipher to memory if necessary, similar to the
late RC4. Speck is probably my favorite block cipher right now,
<em>except</em> that I couldn’t figure out the key schedule for any of the
other key/block size variants.</p>

<p>Another cipher I studied, though in less depth, was <a href="http://people.csail.mit.edu/rivest/Rivest-rc5rev.pdf">RC5</a> (1994),
a block cipher by (obviously) Ron Rivest. The most novel part of RC5
is its use of data dependent rotations. This was a very deliberate
decision, and the paper makes this clear:</p>

<blockquote>
  <p>RC5 should <em>highlight the use of data-dependent rotations</em>, and
encourage the assessment of the cryptographic strength data-dependent
rotations can provide.</p>
</blockquote>

<p>What’s a data-dependent rotation. In the Speck cipher shown above,
notice how the right-hand side of all the rotation operations is a
constant (3, 8, 56, and 61). Suppose that these operands were not
constant, instead they were based on some part of the value of the
block:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="o">*</span><span class="n">y</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">;</span>
    <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="n">r</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="o">*</span><span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">r</span><span class="p">));</span>
</code></pre></div></div>

<p>Such “random” rotations “frustrate differential cryptanalysis” according
to the paper, increasing the strength of the cipher.</p>

<p>Another algorithm that uses data-dependent shift is the <a href="http://www.pcg-random.org/">PCG family of
PRNGs</a>. Honestly, the data-dependent “permutation” shift is <em>the</em>
defining characteristic of PCG. As a reminder, here’s the simplified PCG
from my shootout:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">spcg32</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice how the final shift depends on the high order bits of the PRNG
state. (This one weird trick from Melissa O’Neil will significantly
improve your PRNG. Xorshift experts hate her.)</p>

<p>I think this raises a really interesting question: Why did it take until
2014 for someone to apply a data-dependent shift to a PRNG? Similarly,
why are <a href="https://crypto.stackexchange.com/q/20325">data-dependent rotations not used in many ciphers</a>?</p>

<p>My own theory is that this is because many older instruction set
architectures can’t perform data-dependent shift operations efficiently.</p>

<p>Many instruction sets only have a fixed shift (e.g. 1-bit), or can
only shift by an immediate (e.g. a constant). In these cases, a
data-dependent shift would require a loop. These loops would be a ripe
source of side channel attacks in ciphers due to the difficultly of
making them operate in constant time. It would also be relatively slow
for video game PRNGs, which often needed to run on constrained
hardware with limited instruction sets. For example, the 6502 (Atari,
Apple II, NES, Commodore 64) and the Z80 (too many to list) can only
shift/rotate one bit at a time.</p>

<p>Even on an architecture with an instruction for data-dependent shifts,
such as the x86, those shifts will be slower than constant shifts, at
least in part due to the additional data dependency.</p>

<p>It turns out there are also some patent issues (ex. <a href="https://www.google.com/patents/US5724428">1</a>, <a href="https://www.google.com/patents/US6269163">2</a>).
Fortunately most of these patents have now expired, and one in
particular is set to expire this June. I still like my theory better.</p>

<h3 id="to-branchless-decoding">To branchless decoding</h3>

<p>So I was thinking about data-dependent shifts, and I had also noticed I
could trivially check the length of a UTF-8 code point using a small
lookup table — the first step in my decoder. What if I combined these: a
data-dependent shift based on a table lookup. This would become the last
step in my decoder. The idea for a branchless UTF-8 decoder was
essentially borne out of connecting these two thoughts, and then filling
in the middle.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Render Multimedia in Pure C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/11/03/"/>
    <id>urn:uuid:4b36dd78-e85d-3637-8cd5-e44a2d3e683a</id>
    <updated>2017-11-03T22:31:15Z</updated>
    <category term="c"/><category term="media"/><category term="trick"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>Update 2020</em>: I’ve produced <a href="/blog/2020/06/29/">many more examples</a> over the years
(<a href="https://github.com/skeeto/scratch/tree/master/animation">even more</a>).</p>

<p>In a previous article <a href="/blog/2017/07/02/">I demonstrated video filtering with C and a
unix pipeline</a>. Thanks to the ubiquitous support for the
ridiculously simple <a href="https://en.wikipedia.org/wiki/Netpbm_format">Netpbm formats</a> — specifically the “Portable
PixMap” (<code class="language-plaintext highlighter-rouge">.ppm</code>, <code class="language-plaintext highlighter-rouge">P6</code>) binary format — it’s trivial to parse and
produce image data in any language without image libraries. Video
decoders and encoders at the ends of the pipeline do the heavy lifting
of processing the complicated video formats actually used to store and
transmit video.</p>

<p>Naturally this same technique can be used to <em>produce</em> new video in a
simple program. All that’s needed are a few functions to render
artifacts — lines, shapes, etc. — to an RGB buffer. With a bit of
basic sound synthesis, the same concept can be applied to create audio
in a separate audio stream — in this case using the simple (but not as
simple as Netpbm) WAV format. Put them together and a small,
standalone program can create multimedia.</p>

<p>Here’s the demonstration video I’ll be going through in this article.
It animates and visualizes various in-place sorting algorithms (<a href="/blog/2016/09/05/">see
also</a>). The elements are rendered as colored dots, ordered by
hue, with red at 12 o’clock. A dot’s distance from the center is
proportional to its corresponding element’s distance from its correct
position. Each dot emits a sinusoidal tone with a unique frequency
when it swaps places in a particular frame.</p>

<p><a href="/video/?v=sort-circle"><img src="/img/sort-circle/video.png" alt="" /></a></p>

<p>Original credit for this visualization concept goes to <a href="https://www.youtube.com/watch?v=sYd_-pAfbBw">w0rthy</a>.</p>

<p>All of the source code (less than 600 lines of C), ready to run, can be
found here:</p>

<ul>
  <li><strong><a href="https://github.com/skeeto/sort-circle">https://github.com/skeeto/sort-circle</a></strong></li>
</ul>

<p>On any modern computer, rendering is real-time, even at 60 FPS, so you
may be able to pipe the program’s output directly into your media player
of choice. (If not, consider getting a better media player!)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./sort | mpv --no-correct-pts --fps=60 -
</code></pre></div></div>

<p>VLC requires some help from <a href="http://mjpeg.sourceforge.net/">ppmtoy4m</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./sort | ppmtoy4m -F60:1 | vlc -
</code></pre></div></div>

<p>Or you can just encode it to another format. Recent versions of
libavformat can input PPM images directly, which means <code class="language-plaintext highlighter-rouge">x264</code> can read
the program’s output directly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./sort | x264 --fps 60 -o video.mp4 /dev/stdin
</code></pre></div></div>

<p>By default there is no audio output. I wish there was a nice way to
embed audio with the video stream, but this requires a container and
that would destroy all the simplicity of this project. So instead, the
<code class="language-plaintext highlighter-rouge">-a</code> option captures the audio in a separate file. Use <code class="language-plaintext highlighter-rouge">ffmpeg</code> to
combine the audio and video into a single media file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./sort -a audio.wav | x264 --fps 60 -o video.mp4 /dev/stdin
$ ffmpeg -i video.mp4 -i audio.wav -vcodec copy -acodec mp3 \
         combined.mp4
</code></pre></div></div>

<p>You might think you’ll be clever by using <code class="language-plaintext highlighter-rouge">mkfifo</code> (i.e. a named pipe)
to pipe both audio and video into ffmpeg at the same time. This will
only result in a deadlock since neither program is prepared for this.
One will be blocked writing one stream while the other is blocked
reading on the other stream.</p>

<p>Several years ago <a href="/blog/2016/09/02/">my intern and I</a> used the exact same pure C
rendering technique to produce these raytracer videos:</p>

<p>
<video width="600" controls="controls">
  <source type="video/webm" src="https://skeeto.s3.amazonaws.com/netray/bigdemo_full.webm" />
</video>
</p>

<p>
<video width="600" controls="controls">
  <source type="video/webm" src="https://skeeto.s3.amazonaws.com/netray/bounce720.webm" />
</video>
</p>

<p>I also used this technique to <a href="/blog/2017/09/07/">illustrate gap buffers</a>.</p>

<h3 id="pixel-format-and-rendering">Pixel format and rendering</h3>

<p>This program really only has one purpose: rendering a sorting video
with a fixed, square resolution. So rather than write generic image
rendering functions, some assumptions will be hard coded. For example,
the video size will just be hard coded and assumed square, making it
simpler and faster. I chose 800x800 as the default:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define S     800
</span></code></pre></div></div>

<p>Rather than define some sort of color struct with red, green, and blue
fields, color will be represented by a 24-bit integer (<code class="language-plaintext highlighter-rouge">long</code>). I
arbitrarily chose red to be the most significant 8 bits. This has
nothing to do with the order of the individual channels in Netpbm
since these integers are never dumped out. (This would have stupid
byte-order issues anyway.) “Color literals” are particularly
convenient and familiar in this format. For example, the constant for
pink: <code class="language-plaintext highlighter-rouge">0xff7f7fUL</code>.</p>

<p>In practice the color channels will be operated upon separately, so
here are a couple of helper functions to convert the channels between
this format and normalized floats (0.0–1.0).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">rgb_split</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">c</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">r</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">g</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="o">*</span><span class="n">r</span> <span class="o">=</span> <span class="p">((</span><span class="n">c</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">/</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
    <span class="o">*</span><span class="n">g</span> <span class="o">=</span> <span class="p">(((</span><span class="n">c</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xff</span><span class="p">)</span> <span class="o">/</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
    <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="p">((</span><span class="n">c</span> <span class="o">&amp;</span> <span class="mh">0xff</span><span class="p">)</span> <span class="o">/</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span>
<span class="nf">rgb_join</span><span class="p">(</span><span class="kt">float</span> <span class="n">r</span><span class="p">,</span> <span class="kt">float</span> <span class="n">g</span><span class="p">,</span> <span class="kt">float</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">ir</span> <span class="o">=</span> <span class="n">roundf</span><span class="p">(</span><span class="n">r</span> <span class="o">*</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">ig</span> <span class="o">=</span> <span class="n">roundf</span><span class="p">(</span><span class="n">g</span> <span class="o">*</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">ib</span> <span class="o">=</span> <span class="n">roundf</span><span class="p">(</span><span class="n">b</span> <span class="o">*</span> <span class="mi">255</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">ir</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">ig</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="n">ib</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Originally I decided the integer form would be sRGB, and these
functions handled the conversion to and from sRGB. Since it had no
noticeable effect on the output video, I discarded it. In more
sophisticated rendering you may want to take this into account.</p>

<p>The RGB buffer where images are rendered is just a plain old byte
buffer with the same pixel format as PPM. The <code class="language-plaintext highlighter-rouge">ppm_set()</code> function
writes a color to a particular pixel in the buffer, assumed to be <code class="language-plaintext highlighter-rouge">S</code>
by <code class="language-plaintext highlighter-rouge">S</code> pixels. The complement to this function is <code class="language-plaintext highlighter-rouge">ppm_get()</code>, which
will be needed for blending.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">ppm_set</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">color</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">color</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">;</span>
    <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">color</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">;</span>
    <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">color</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">long</span>
<span class="nf">ppm_get</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">r</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">0</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">g</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">b</span> <span class="o">=</span> <span class="n">buf</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">+</span> <span class="mi">2</span><span class="p">];</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">g</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span><span class="p">)</span> <span class="o">|</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the buffer is already in the right format, writing an image is
dead simple. I like to flush after each frame so that observers
generally see clean, complete frames. It helps in debugging.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">ppm_write</span><span class="p">(</span><span class="k">const</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fprintf</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="s">"P6</span><span class="se">\n</span><span class="s">%d %d</span><span class="se">\n</span><span class="s">255</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">S</span><span class="p">,</span> <span class="n">S</span><span class="p">);</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="n">S</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="n">fflush</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="dot-rendering">Dot rendering</h3>

<p>If you zoom into one of those dots, you may notice it has a nice
smooth edge. Here’s one rendered at 30x the normal resolution. I did
not render, then scale this image in another piece of software. This
is straight out of the C program.</p>

<p><img src="/img/sort-circle/dot.png" alt="" /></p>

<p>In an early version of this program I used a dumb dot rendering
routine. It took a color and a hard, integer pixel coordinate. All the
pixels within a certain distance of this coordinate were set to the
color, everything else was left alone. This had two bad effects:</p>

<ul>
  <li>
    <p>Dots <em>jittered</em> as they moved around since their positions were
rounded to the nearest pixel for rendering. A dot would be centered on
one pixel, then suddenly centered on another pixel. This looked bad
even when those pixels were adjacent.</p>
  </li>
  <li>
    <p>There’s no blending between dots when they overlap, making the lack of
anti-aliasing even more pronounced.</p>
  </li>
</ul>

<video src="/img/sort-circle/flyby.mp4" loop="loop" autoplay="autoplay" width="600">
</video>

<p>Instead the dot’s position is computed in floating point and is
actually rendered as if it were between pixels. This is done with a
shader-like routine that uses <a href="https://en.wikipedia.org/wiki/Smoothstep">smoothstep</a> — just as <a href="/tags/opengl/">found in
shader languages</a> — to give the dot a smooth edge. That edge
is blended into the image, whether that’s the background or a
previously-rendered dot. The input to the smoothstep is the distance
from the floating point coordinate to the center (or corner?) of the
pixel being rendered, maintaining that between-pixel smoothness.</p>

<p>Rather than dump the whole function here, let’s look at it piece by
piece. I have two new constants to define the inner dot radius and the
outer dot radius. It’s smooth between these radii.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define R0    (S / 400.0f)  // dot inner radius
#define R1    (S / 200.0f)  // dot outer radius
</span></code></pre></div></div>

<p>The dot-drawing function takes the image buffer, the dot’s coordinates,
and its foreground color.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">ppm_dot</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fgc</span><span class="p">);</span>
</code></pre></div></div>

<p>The first thing to do is extract the color components.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">float</span> <span class="n">fr</span><span class="p">,</span> <span class="n">fg</span><span class="p">,</span> <span class="n">fb</span><span class="p">;</span>
    <span class="n">rgb_split</span><span class="p">(</span><span class="n">fgc</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fg</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">fb</span><span class="p">);</span>
</code></pre></div></div>

<p>Next determine the range of pixels over which the dot will be draw.
These are based on the two radii and will be used for looping.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">miny</span> <span class="o">=</span> <span class="n">floorf</span><span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">R1</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">maxy</span> <span class="o">=</span> <span class="n">ceilf</span><span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">R1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">minx</span> <span class="o">=</span> <span class="n">floorf</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">R1</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">maxx</span> <span class="o">=</span> <span class="n">ceilf</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">R1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>

<p>Here’s the loop structure. Everything else will be inside the innermost
loop. The <code class="language-plaintext highlighter-rouge">dx</code> and <code class="language-plaintext highlighter-rouge">dy</code> are the floating point distances from the center
of the dot.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">py</span> <span class="o">=</span> <span class="n">miny</span><span class="p">;</span> <span class="n">py</span> <span class="o">&lt;=</span> <span class="n">maxy</span><span class="p">;</span> <span class="n">py</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">float</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">py</span> <span class="o">-</span> <span class="n">y</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">px</span> <span class="o">=</span> <span class="n">minx</span><span class="p">;</span> <span class="n">px</span> <span class="o">&lt;=</span> <span class="n">maxx</span><span class="p">;</span> <span class="n">px</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">float</span> <span class="n">dx</span> <span class="o">=</span> <span class="n">px</span> <span class="o">-</span> <span class="n">x</span><span class="p">;</span>
            <span class="cm">/* ... */</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Use the x and y distances to compute the distance and smoothstep
value, which will be the alpha. Within the inner radius the color is
on 100%. Outside the outer radius it’s 0%. Elsewhere it’s something in
between.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            <span class="kt">float</span> <span class="n">d</span> <span class="o">=</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span> <span class="o">+</span> <span class="n">dx</span> <span class="o">*</span> <span class="n">dx</span><span class="p">);</span>
            <span class="kt">float</span> <span class="n">a</span> <span class="o">=</span> <span class="n">smoothstep</span><span class="p">(</span><span class="n">R1</span><span class="p">,</span> <span class="n">R0</span><span class="p">,</span> <span class="n">d</span><span class="p">);</span>
</code></pre></div></div>

<p>Get the background color, extract its components, and blend the
foreground and background according to the computed alpha value. Finally
write the pixel back into the buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">bgc</span> <span class="o">=</span> <span class="n">ppm_get</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">px</span><span class="p">,</span> <span class="n">py</span><span class="p">);</span>
            <span class="kt">float</span> <span class="n">br</span><span class="p">,</span> <span class="n">bg</span><span class="p">,</span> <span class="n">bb</span><span class="p">;</span>
            <span class="n">rgb_split</span><span class="p">(</span><span class="n">bgc</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">br</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">bg</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">bb</span><span class="p">);</span>

            <span class="kt">float</span> <span class="n">r</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">fr</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="n">br</span><span class="p">;</span>
            <span class="kt">float</span> <span class="n">g</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">fg</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="n">bg</span><span class="p">;</span>
            <span class="kt">float</span> <span class="n">b</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">fb</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="n">bb</span><span class="p">;</span>
            <span class="n">ppm_set</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">px</span><span class="p">,</span> <span class="n">py</span><span class="p">,</span> <span class="n">rgb_join</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">));</span>
</code></pre></div></div>

<p>That’s all it takes to render a smooth dot anywhere in the image.</p>

<h3 id="rendering-the-array">Rendering the array</h3>

<p>The array being sorted is just a global variable. This simplifies some
of the sorting functions since a few are implemented recursively. They
can call for a frame to be rendered without needing to pass the full
array. With the dot-drawing routine done, rendering a frame is easy:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define N     360           // number of dots
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">array</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">frame</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="n">S</span> <span class="o">*</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">3</span><span class="p">];</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">));</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">float</span> <span class="n">delta</span> <span class="o">=</span> <span class="n">abs</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="n">N</span> <span class="o">/</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
        <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="o">-</span><span class="n">sinf</span><span class="p">(</span><span class="n">i</span> <span class="o">*</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">*</span> <span class="n">PI</span> <span class="o">/</span> <span class="n">N</span><span class="p">);</span>
        <span class="kt">float</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="n">cosf</span><span class="p">(</span><span class="n">i</span> <span class="o">*</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">*</span> <span class="n">PI</span> <span class="o">/</span> <span class="n">N</span><span class="p">);</span>
        <span class="kt">float</span> <span class="n">r</span> <span class="o">=</span> <span class="n">S</span> <span class="o">*</span> <span class="mi">15</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">/</span> <span class="mi">32</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">-</span> <span class="n">delta</span><span class="p">);</span>
        <span class="kt">float</span> <span class="n">px</span> <span class="o">=</span> <span class="n">r</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">S</span> <span class="o">/</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
        <span class="kt">float</span> <span class="n">py</span> <span class="o">=</span> <span class="n">r</span> <span class="o">*</span> <span class="n">y</span> <span class="o">+</span> <span class="n">S</span> <span class="o">/</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">ppm_dot</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">px</span><span class="p">,</span> <span class="n">py</span><span class="p">,</span> <span class="n">hue</span><span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">]));</span>
    <span class="p">}</span>
    <span class="n">ppm_write</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The buffer is <code class="language-plaintext highlighter-rouge">static</code> since it will be rather large, especially if <code class="language-plaintext highlighter-rouge">S</code>
is cranked up. Otherwise it’s likely to overflow the stack. The
<code class="language-plaintext highlighter-rouge">memset()</code> fills it with black. If you wanted a different background
color, here’s where you change it.</p>

<p>For each element, compute its delta from the proper array position,
which becomes its distance from the center of the image. The angle is
based on its actual position. The <code class="language-plaintext highlighter-rouge">hue()</code> function (not shown in this
article) returns the color for the given element.</p>

<p>With the <code class="language-plaintext highlighter-rouge">frame()</code> function complete, all I need is a sorting function
that calls <code class="language-plaintext highlighter-rouge">frame()</code> at appropriate times. Here are a couple of
examples:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">shuffle</span><span class="p">(</span><span class="kt">int</span> <span class="n">array</span><span class="p">[</span><span class="n">N</span><span class="p">],</span> <span class="kt">uint64_t</span> <span class="o">*</span><span class="n">rng</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">pcg32</span><span class="p">(</span><span class="n">rng</span><span class="p">)</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">swap</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">r</span><span class="p">);</span>
        <span class="n">frame</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">sort_bubble</span><span class="p">(</span><span class="kt">int</span> <span class="n">array</span><span class="p">[</span><span class="n">N</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">c</span><span class="p">;</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">c</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">array</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">swap</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span>
                <span class="n">c</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">frame</span><span class="p">();</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">c</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="synthesizing-audio">Synthesizing audio</h3>

<p>To add audio I need to keep track of which elements were swapped in
this frame. When producing a frame I need to generate and mix tones
for each element that was swapped.</p>

<p>Notice the <code class="language-plaintext highlighter-rouge">swap()</code> function above? That’s not just for convenience.
That’s also how things are tracked for the audio.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">swaps</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">swap</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">[</span><span class="n">N</span><span class="p">],</span> <span class="kt">int</span> <span class="n">i</span><span class="p">,</span> <span class="kt">int</span> <span class="n">j</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">tmp</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
    <span class="n">a</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
    <span class="n">swaps</span><span class="p">[(</span><span class="n">a</span> <span class="o">-</span> <span class="n">array</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span>
    <span class="n">swaps</span><span class="p">[(</span><span class="n">a</span> <span class="o">-</span> <span class="n">array</span><span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Before we get ahead of ourselves I need to write a <a href="http://soundfile.sapp.org/doc/WaveFormat/">WAV header</a>.
Without getting into the purpose of each field, just note that the
header has 13 fields, followed immediately by 16-bit little endian PCM
samples. There will be only one channel (monotone).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define HZ    44100         // audio sample rate
</span>
<span class="k">static</span> <span class="kt">void</span>
<span class="nf">wav_init</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emit_u32be</span><span class="p">(</span><span class="mh">0x52494646UL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// "RIFF"</span>
    <span class="n">emit_u32le</span><span class="p">(</span><span class="mh">0xffffffffUL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// file length</span>
    <span class="n">emit_u32be</span><span class="p">(</span><span class="mh">0x57415645UL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// "WAVE"</span>
    <span class="n">emit_u32be</span><span class="p">(</span><span class="mh">0x666d7420UL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// "fmt "</span>
    <span class="n">emit_u32le</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span>           <span class="n">f</span><span class="p">);</span> <span class="c1">// struct size</span>
    <span class="n">emit_u16le</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span>            <span class="n">f</span><span class="p">);</span> <span class="c1">// PCM</span>
    <span class="n">emit_u16le</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span>            <span class="n">f</span><span class="p">);</span> <span class="c1">// mono</span>
    <span class="n">emit_u32le</span><span class="p">(</span><span class="n">HZ</span><span class="p">,</span>           <span class="n">f</span><span class="p">);</span> <span class="c1">// sample rate (i.e. 44.1 kHz)</span>
    <span class="n">emit_u32le</span><span class="p">(</span><span class="n">HZ</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span>       <span class="n">f</span><span class="p">);</span> <span class="c1">// byte rate</span>
    <span class="n">emit_u16le</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span>            <span class="n">f</span><span class="p">);</span> <span class="c1">// block size</span>
    <span class="n">emit_u16le</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span>           <span class="n">f</span><span class="p">);</span> <span class="c1">// bits per sample</span>
    <span class="n">emit_u32be</span><span class="p">(</span><span class="mh">0x64617461UL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// "data"</span>
    <span class="n">emit_u32le</span><span class="p">(</span><span class="mh">0xffffffffUL</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span> <span class="c1">// byte length</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Rather than tackle the annoying problem of figuring out the total
length of the audio ahead of time, I just wave my hands and write the
maximum possible number of bytes (<code class="language-plaintext highlighter-rouge">0xffffffff</code>). Most software that
can read WAV files will understand this to mean the entire rest of the
file contains samples.</p>

<p>With the header out of the way all I have to do is write 1/60th of a
second worth of samples to this file each time a frame is produced.
That’s 735 samples (1,470 bytes) at 44.1kHz.</p>

<p>The simplest place to do audio synthesis is in <code class="language-plaintext highlighter-rouge">frame()</code> right after
rendering the image.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define FPS   60            // output framerate
#define MINHZ 20            // lowest tone
#define MAXHZ 1000          // highest tone
</span>
<span class="k">static</span> <span class="kt">void</span>
<span class="nf">frame</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="cm">/* ... rendering ... */</span>

    <span class="cm">/* ... synthesis ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With the largest tone frequency at 1kHz, <a href="https://en.wikipedia.org/wiki/Nyquist_frequency">Nyquist</a> says we only
need to sample at 2kHz. 8kHz is a very common sample rate and gives
some overhead space, making it a good choice. However, I found that
audio encoding software was a lot happier to accept the standard CD
sample rate of 44.1kHz, so I stuck with that.</p>

<p>The first thing to do is to allocate and zero a buffer for this
frame’s samples.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">nsamples</span> <span class="o">=</span> <span class="n">HZ</span> <span class="o">/</span> <span class="n">FPS</span><span class="p">;</span>
    <span class="k">static</span> <span class="kt">float</span> <span class="n">samples</span><span class="p">[</span><span class="n">HZ</span> <span class="o">/</span> <span class="n">FPS</span><span class="p">];</span>
    <span class="n">memset</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">samples</span><span class="p">));</span>
</code></pre></div></div>

<p>Next determine how many “voices” there are in this frame. This is used
to mix the samples by averaging them. If an element was swapped more
than once this frame, it’s a little louder than the others — i.e. it’s
played twice at the same time, in phase.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">voices</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">voices</span> <span class="o">+=</span> <span class="n">swaps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</code></pre></div></div>

<p>Here’s the most complicated part. I use <code class="language-plaintext highlighter-rouge">sinf()</code> to produce the
sinusoidal wave based on the element’s frequency. I also use a parabola
as an <em>envelope</em> to shape the beginning and ending of this tone so that
it fades in and fades out. Otherwise you get the nasty, high-frequency
“pop” sound as the wave is given a hard cut off.</p>

<p><img src="/img/sort-circle/envelope.svg" alt="" /></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">swaps</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="kt">float</span> <span class="n">hz</span> <span class="o">=</span> <span class="n">i</span> <span class="o">*</span> <span class="p">(</span><span class="n">MAXHZ</span> <span class="o">-</span> <span class="n">MINHZ</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="n">N</span> <span class="o">+</span> <span class="n">MINHZ</span><span class="p">;</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">nsamples</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
                <span class="kt">float</span> <span class="n">u</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">-</span> <span class="n">j</span> <span class="o">/</span> <span class="p">(</span><span class="kt">float</span><span class="p">)(</span><span class="n">nsamples</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
                <span class="kt">float</span> <span class="n">parabola</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">-</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">u</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
                <span class="kt">float</span> <span class="n">envelope</span> <span class="o">=</span> <span class="n">parabola</span> <span class="o">*</span> <span class="n">parabola</span> <span class="o">*</span> <span class="n">parabola</span><span class="p">;</span>
                <span class="kt">float</span> <span class="n">v</span> <span class="o">=</span> <span class="n">sinf</span><span class="p">(</span><span class="n">j</span> <span class="o">*</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">*</span> <span class="n">PI</span> <span class="o">/</span> <span class="n">HZ</span> <span class="o">*</span> <span class="n">hz</span><span class="p">)</span> <span class="o">*</span> <span class="n">envelope</span><span class="p">;</span>
                <span class="n">samples</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">swaps</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">v</span> <span class="o">/</span> <span class="n">voices</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Finally I write out each sample as a signed 16-bit value. I flush the
frame audio just like I flushed the frame image, keeping them somewhat
in sync from an outsider’s perspective.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nsamples</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">s</span> <span class="o">=</span> <span class="n">samples</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="mh">0x7fff</span><span class="p">;</span>
        <span class="n">emit_u16le</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">wav</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">fflush</span><span class="p">(</span><span class="n">wav</span><span class="p">);</span>
</code></pre></div></div>

<p>Before returning, reset the swap counter for the next frame.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">memset</span><span class="p">(</span><span class="n">swaps</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">swaps</span><span class="p">));</span>
</code></pre></div></div>

<h3 id="font-rendering">Font rendering</h3>

<p>You may have noticed there was text rendered in the corner of the video
announcing the sort function. There’s font bitmap data in <code class="language-plaintext highlighter-rouge">font.h</code> which
gets sampled to render that text. It’s not terribly complicated, but
you’ll have to study the code on your own to see how that works.</p>

<h3 id="learning-more">Learning more</h3>

<p>This simple video rendering technique has served me well for some
years now. All it takes is a bit of knowledge about rendering. I
learned quite a bit just from watching <a href="https://www.youtube.com/user/handmadeheroarchive">Handmade Hero</a>, where
Casey writes a software renderer from scratch, then implements a
nearly identical renderer with OpenGL. The more I learn about
rendering, the better this technique works.</p>

<p>Before writing this post I spent some time experimenting with using a
media player as a interface to a game. For example, rather than render
the game using OpenGL or similar, render it as PPM frames and send it
to the media player to be displayed, just as game consoles drive
television sets. Unfortunately the latency is <em>horrible</em> — multiple
seconds — so that idea just doesn’t work. So while this technique is
fast enough for real time rendering, it’s no good for interaction.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>A Branchless UTF-8 Decoder</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/10/06/"/>
    <id>urn:uuid:d62a6a1f-0e34-325e-9196-d66a354bc9b1</id>
    <updated>2017-10-06T23:29:02Z</updated>
    <category term="c"/><category term="optimization"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>This week I took a crack at writing a branchless <a href="https://tools.ietf.org/html/rfc3629">UTF-8</a> decoder:
a function that decodes a single UTF-8 code point from a byte stream
without any <code class="language-plaintext highlighter-rouge">if</code> statements, loops, short-circuit operators, or other
sorts of conditional jumps. You can find the source code here along
with a test suite and benchmark:</p>

<ul>
  <li><a href="https://github.com/skeeto/branchless-utf8">https://github.com/skeeto/branchless-utf8</a></li>
</ul>

<p>In addition to decoding the next code point, it detects any errors and
returns a pointer to the next code point. It’s the complete package.</p>

<p>Why branchless? Because high performance CPUs are pipelined. That is,
a single instruction is executed over a series of stages, and many
instructions are executed in overlapping time intervals, each at a
different stage.</p>

<p>The usual analogy is laundry. You can have more than one load of
laundry in process at a time because laundry is typically a pipelined
process. There’s a washing machine stage, dryer stage, and folding
stage. One load can be in the washer, a second in the drier, and a
third being folded, all at once. This greatly increases throughput
because, under ideal circumstances with a full pipeline, an
instruction is completed each clock cycle despite any individual
instruction taking many clock cycles to complete.</p>

<p>Branches are the enemy of pipelines. The CPU can’t begin work on the
next instruction if it doesn’t know which instruction will be executed
next. It must finish computing the branch condition before it can
know. To deal with this, pipelined CPUs are also equipped with <em>branch
predictors</em>. It makes a guess at which branch will be taken and begins
executing instructions on that branch. The prediction is initially
made using static heuristics, and later those predictions are improved
<a href="http://www.agner.org/optimize/microarchitecture.pdf">by learning from previous behavior</a>. This even includes
predicting the number of iterations of a loop so that the final
iteration isn’t mispredicted.</p>

<p>A mispredicted branch has two dire consequences. First, all the
progress on the incorrect branch will need to be discarded. Second,
the pipeline will be flushed, and the CPU will be inefficient until
the pipeline fills back up with instructions on the correct branch.
With a sufficiently deep pipeline, it can easily be <strong>more efficient
to compute and discard an unneeded result than to avoid computing it
in the first place</strong>. Eliminating branches means eliminating the
hazards of misprediction.</p>

<p>Another hazard for pipelines is <em>dependencies</em>. If an instruction
depends on the result of a previous instruction, it may have to wait for
the previous instruction to make sufficient progress before it can
complete one of its stages. This is known as a <em>pipeline stall</em>, and it
is an important consideration in instruction set architecture (ISA)
design.</p>

<p>For example, on the x86-64 architecture, storing a 32-bit result in a
64-bit register will automatically clear the upper 32 bits of that
register. Any further use of that destination register cannot depend on
prior instructions since all bits have been set. This particular
optimization was missed in the design of the i386: Writing a 16-bit
result to 32-bit register leaves the upper 16 bits intact, creating
false dependencies.</p>

<p>Dependency hazards are mitigated using <em>out-of-order execution</em>.
Rather than execute two dependent instructions back to back, which
would result in a stall, the CPU may instead executing an independent
instruction further away in between. A good compiler will also try to
spread out dependent instructions in its own instruction scheduling.</p>

<p>The effects of out-of-order execution are typically not visible to a
single thread, where everything will appear to have executed in order.
However, when multiple processes or threads can access the same memory
<a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">out-of-order execution can be observed</a>. It’s one of the
many <a href="/blog/2014/09/02/">challenges of writing multi-threaded software</a>.</p>

<p>The focus of my UTF-8 decoder was to be branchless, but there was one
interesting dependency hazard that neither GCC nor Clang were able to
resolve themselves. More on that later.</p>

<h3 id="what-is-utf-8">What is UTF-8?</h3>

<p>Without getting into the history of it, you can generally think of
<a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8</a> as a method for encoding a series of 21-bit integers
(<em>code points</em>) into a stream of bytes.</p>

<ul>
  <li>
    <p>Shorter integers encode to fewer bytes than larger integers. The
shortest available encoding must be chosen, meaning there is one
canonical encoding for a given sequence of code points.</p>
  </li>
  <li>
    <p>Certain code points are off limits: <em>surrogate halves</em>. These are
code points <code class="language-plaintext highlighter-rouge">U+D800</code> through <code class="language-plaintext highlighter-rouge">U+DFFF</code>. Surrogates are used in UTF-16
to represent code points above U+FFFF and serve no purpose in UTF-8.
This has <a href="https://simonsapin.github.io/wtf-8/">interesting consequences</a> for pseudo-Unicode
strings, such “wide” strings in the Win32 API, where surrogates may
appear unpaired. Such sequences cannot legally be represented in
UTF-8.</p>
  </li>
</ul>

<p>Keeping in mind these two rules, the entire format is summarized by
this table:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>length byte[0]  byte[1]  byte[2]  byte[3]
1      0xxxxxxx
2      110xxxxx 10xxxxxx
3      1110xxxx 10xxxxxx 10xxxxxx
4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">x</code> placeholders are the bits of the encoded code point.</p>

<p>UTF-8 has some really useful properties:</p>

<ul>
  <li>
    <p>It’s backwards compatible with ASCII, which never used the highest
bit.</p>
  </li>
  <li>
    <p>Sort order is preserved. Sorting a set of code point sequences has the
same result as sorting their UTF-8 encoding.</p>
  </li>
  <li>
    <p>No additional zero bytes are introduced. In C we can continue using
null terminated <code class="language-plaintext highlighter-rouge">char</code> buffers, often without even realizing they
hold UTF-8 data.</p>
  </li>
  <li>
    <p>It’s self-synchronizing. A leading byte will never be mistaken for a
continuation byte. This allows for byte-wise substring searches,
meaning UTF-8 unaware functions like <code class="language-plaintext highlighter-rouge">strstr(3)</code> continue to work
without modification (except for normalization issues). It also
allows for unambiguous recovery of a damaged stream.</p>
  </li>
</ul>

<p>A straightforward approach to decoding might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">utf8_simple</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">c</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="mh">0x80</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x1f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">2</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf0</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xe0</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x0f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">3</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xf8</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0xf0</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mh">0xf4</span><span class="p">))</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">)</span> <span class="o">|</span>
             <span class="p">((</span><span class="kt">long</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">4</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// invalid</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// skip this byte</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;=</span> <span class="mh">0xd800</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">c</span> <span class="o">&lt;=</span> <span class="mh">0xdfff</span><span class="p">)</span>
        <span class="o">*</span><span class="n">c</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// surrogate half</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It branches off on the highest bits of the leading byte, extracts all of
those <code class="language-plaintext highlighter-rouge">x</code> bits from each byte, concatenates those bits, checks if it’s a
surrogate half, and returns a pointer to the next character. (This
implementation does <em>not</em> check that the highest two bits of each
continuation byte are correct.)</p>

<p>The CPU must correctly predict the length of the code point or else it
will suffer a hazard. An incorrect guess will stall the pipeline and
slow down decoding.</p>

<p>In real world text this is probably not a serious issue. For the
English language, the encoded length is nearly always a single byte.
However, even for non-English languages, text is <a href="http://utf8everywhere.org/">usually accompanied
by markup from the ASCII range of characters</a>, and, overall,
the encoded lengths will still have consistency. As I said, the CPU
predicts branches based on the program’s previous behavior, so this
means it will temporarily learn some of the statistical properties of
the language being actively decoded. Pretty cool, eh?</p>

<p>Eliminating branches from the decoder side-steps any issues with
mispredicting encoded lengths. Only errors in the stream will cause
stalls. Since that’s probably the unusual case, the branch predictor
will be very successful by continually predicting success. That’s one
optimistic CPU.</p>

<h3 id="the-branchless-decoder">The branchless decoder</h3>

<p>Here’s the interface to my branchless decoder:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">utf8_decode</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">e</span><span class="p">);</span>
</code></pre></div></div>

<p>I chose <code class="language-plaintext highlighter-rouge">void *</code> for the buffer so that it doesn’t care what type was
actually chosen to represent the buffer. It could be a <code class="language-plaintext highlighter-rouge">uint8_t</code>,
<code class="language-plaintext highlighter-rouge">char</code>, <code class="language-plaintext highlighter-rouge">unsigned char</code>, etc. Doesn’t matter. The encoder accesses it
only as bytes.</p>

<p>On the other hand, with this interface you’re forced to use <code class="language-plaintext highlighter-rouge">uint32_t</code>
to represent code points. You could always change the function to suit
your own needs, though.</p>

<p>Errors are returned in <code class="language-plaintext highlighter-rouge">e</code>. It’s zero for success and non-zero when an
error was detected, without any particular meaning for different values.
Error conditions are mixed into this integer, so a zero simply means the
absence of error.</p>

<p>This is where you could accuse me of “cheating” a little bit. The
caller probably wants to check for errors, and so <em>they</em> will have to
branch on <code class="language-plaintext highlighter-rouge">e</code>. It seems I’ve just smuggled the branches outside of the
decoder.</p>

<p>However, as I pointed out, unless you’re expecting lots of errors, the
real cost is branching on encoded lengths. Furthermore, the caller
could instead accumulate the errors: count them, or make the error
“sticky” by ORing all <code class="language-plaintext highlighter-rouge">e</code> values together. Neither of these require a
branch. The caller could decode a huge stream and only check for
errors at the very end. The only branch would be the main loop (“are
we done yet?”), which is trivial to predict with high accuracy.</p>

<p>The first thing the function does is extract the encoded length of the
next code point:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">lengths</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
        <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span>
    <span class="p">};</span>

    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">lengths</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="mi">3</span><span class="p">];</span>
</code></pre></div></div>

<p>Looking back to the UTF-8 table above, only the highest 5 bits determine
the length. That’s 32 possible values. The zeros are for invalid
prefixes. This will later cause a bit to be set in <code class="language-plaintext highlighter-rouge">e</code>.</p>

<p>With the length in hand, it can compute the position of the next code
point in the buffer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">s</span> <span class="o">+</span> <span class="n">len</span> <span class="o">+</span> <span class="o">!</span><span class="n">len</span><span class="p">;</span>
</code></pre></div></div>

<p>Originally this expression was the return value, computed at the very
end of the function. However, after inspecting the compiler’s assembly
output, I decided to move it up, and the result was a solid performance
boost. That’s because it spreads out dependent instructions. With the
address of the next code point known so early, <a href="https://www.youtube.com/watch?v=2EWejmkKlxs">the instructions that
decode the next code point can get started early</a>.</p>

<p>The reason for the <code class="language-plaintext highlighter-rouge">!len</code> is so that the pointer is advanced one byte
even in the face of an error (length of zero). Adding that <code class="language-plaintext highlighter-rouge">!len</code> is
actually somewhat costly, though I couldn’t figure out why.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shiftc</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">c</span>  <span class="o">=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">masks</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">18</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">|=</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0x3f</span><span class="p">)</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
    <span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;=</span> <span class="n">shiftc</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>This reads four bytes regardless of the actual length. Avoiding doing
something is branching, so this can’t be helped. The unneeded bits are
shifted out based on the length. That’s all it takes to decode UTF-8
without branching.</p>

<p>One important consequence of always reading four bytes is that <strong>the
caller <em>must</em> zero-pad the buffer to at least four bytes</strong>. In practice,
this means padding the entire buffer with three bytes in case the last
character is a single byte.</p>

<p>The padding must be zero in order to detect errors. Otherwise the
padding might look like legal continuation bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">mins</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">4194304</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">2048</span><span class="p">,</span> <span class="mi">65536</span><span class="p">};</span>
    <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">shifte</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">};</span>

    <span class="o">*</span><span class="n">e</span>  <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">c</span> <span class="o">&lt;</span> <span class="n">mins</span><span class="p">[</span><span class="n">len</span><span class="p">])</span> <span class="o">&lt;&lt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">((</span><span class="o">*</span><span class="n">c</span> <span class="o">&gt;&gt;</span> <span class="mi">11</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x1b</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="mi">7</span><span class="p">;</span>  <span class="c1">// surrogate half?</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">2</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&amp;</span> <span class="mh">0xc0</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">|=</span> <span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>       <span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">6</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">^=</span> <span class="mh">0x2a</span><span class="p">;</span>
    <span class="o">*</span><span class="n">e</span> <span class="o">&gt;&gt;=</span> <span class="n">shifte</span><span class="p">[</span><span class="n">len</span><span class="p">];</span>
</code></pre></div></div>

<p>The first line checks if the shortest encoding was used, setting a bit
in <code class="language-plaintext highlighter-rouge">e</code> if it wasn’t. For a length of 0, this always fails.</p>

<p>The second line checks for a surrogate half by checking for a certain
prefix.</p>

<p>The next three lines accumulate the highest two bits of each
continuation byte into <code class="language-plaintext highlighter-rouge">e</code>. Each should be the bits <code class="language-plaintext highlighter-rouge">10</code>. These bits are
“compared” to <code class="language-plaintext highlighter-rouge">101010</code> (<code class="language-plaintext highlighter-rouge">0x2a</code>) using XOR. The XOR clears these bits as
long as they exactly match.</p>

<p><img src="/img/diagram/utf8-bits.svg" alt="" /></p>

<p>Finally the continuation prefix bits that don’t matter are shifted out.</p>

<h3 id="the-goal">The goal</h3>

<p>My primary — and totally arbitrary — goal was to beat the performance of
<a href="http://bjoern.hoehrmann.de/utf-8/decoder/dfa/">Björn Höhrmann’s DFA-based decoder</a>. Under favorable (and
artificial) benchmark conditions I had moderate success. You can try it
out on your own system by cloning the repository and running <code class="language-plaintext highlighter-rouge">make
bench</code>.</p>

<p>With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the
DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.</p>

<p><em>Update</em>: <a href="https://github.com/skeeto/branchless-utf8/issues/1">Björn pointed out</a> that his site includes a faster
variant of his DFA decoder. It is only 10% slower than the branchless
decoder with GCC, and it’s 20% faster than the branchless decoder with
Clang. So, in a sense, it’s still faster on average, even on a
benchmark that favors a branchless decoder.</p>

<p>The benchmark operates very similarly to <a href="/blog/2017/09/21/">my PRNG shootout</a> (e.g.
<code class="language-plaintext highlighter-rouge">alarm(2)</code>). First a buffer is filled with random UTF-8 data, then the
decoder decodes it again and again until the alarm fires. The
measurement is the number of bytes decoded.</p>

<p>The number of errors is printed at the end (always 0) in order to force
errors to actually get checked for each code point. Otherwise the sneaky
compiler omits the error checking from the branchless decoder, making it
appear much faster than it really is — a serious letdown once I noticed
my error. Since the other decoder is a DFA and error checking is built
into its graph, the compiler can’t really omit its error checking.</p>

<p>I called this “favorable” because the buffer being decoded isn’t
anything natural. Each time a code point is generated, first a length is
chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that
length is generated. The <strong>even distribution of lengths greatly favors a
branchless decoder</strong>. The random distribution inhibits branch
prediction. Real text has a far more favorable distribution.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">randchar</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">r</span> <span class="o">=</span> <span class="n">rand32</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">r</span> <span class="o">&amp;</span> <span class="mh">0x3</span><span class="p">);</span>
    <span class="n">r</span> <span class="o">&gt;&gt;=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">len</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">r</span> <span class="o">%</span> <span class="mi">128</span><span class="p">;</span>
        <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">128</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">2048</span> <span class="o">-</span> <span class="mi">128</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">2048</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">65536</span> <span class="o">-</span> <span class="mi">2048</span><span class="p">);</span>
        <span class="k">case</span> <span class="mi">4</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">65536</span> <span class="o">+</span> <span class="n">r</span> <span class="o">%</span> <span class="p">(</span><span class="mi">131072</span> <span class="o">-</span> <span class="mi">65536</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">abort</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Given the odd input zero-padding requirement and the artificial
parameters of the benchmark, despite the supposed 20% speed boost
under GCC, my branchless decoder is not really any better than the DFA
decoder in practice. It’s just a different approach. In practice I’d
prefer Björn’s DFA decoder.</p>

<p><em>Update</em>: Bryan Donlan has followed up with <a href="https://github.com/bdonlan/branchless-utf8/commit/3802d3b0e10ea16810dd40f8116243971ff7603d">a SIMD UTF-8 decoder</a>.</p>

<p><em>Update 2024</em>: NRK has followed up with <a href="https://nrk.neocities.org/articles/utf8-pext.html">parallel extract decoder</a>.</p>

<p><em>Update 2025</em>: Charles Eckman followed up <a href="https://cceckman.com/writing/branchless-utf8-encoding/">sharing a branchless
encoder</a>, which inspired me to <a href="https://github.com/skeeto/scratch/blob/master/misc/utf8_branchless.c">give it a shot</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Finding the Best 64-bit Simulation PRNG</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/21/"/>
    <id>urn:uuid:637af55f-6e33-31e5-25fa-edb590a16d44</id>
    <updated>2017-09-21T21:25:00Z</updated>
    <category term="c"/><category term="compsci"/><category term="x86"/><category term="crypto"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>August 2018 Update</strong>: <em>xoroshiro128+ fails <a href="http://pracrand.sourceforge.net/">PractRand</a> very
badly. Since this article was published, its authors have supplanted it
with <strong>xoshiro256**</strong>. It has essentially the same performance, but
better statistical properties. xoshiro256** is now my preferred PRNG.</em></p>

<p>I use pseudo-random number generators (PRNGs) a whole lot. They’re an
essential component in lots of algorithms and processes.</p>

<ul>
  <li>
    <p><strong>Monte Carlo simulations</strong>, where PRNGs are used to <a href="https://possiblywrong.wordpress.com/2015/09/15/kanoodle-iq-fit-and-dancing-links/">compute
numeric estimates</a> for problems that are difficult or impossible
to solve analytically.</p>
  </li>
  <li>
    <p><a href="/blog/2017/04/27/"><strong>Monte Carlo tree search AI</strong></a>, where massive numbers of games
are played out randomly in search of an optimal move. This is a
specific application of the last item.</p>
  </li>
  <li>
    <p><a href="https://github.com/skeeto/carpet-fractal-genetics"><strong>Genetic algorithms</strong></a>, where a PRNG creates the initial
population, and then later guides in mutation and breeding of selected
solutions.</p>
  </li>
  <li>
    <p><a href="https://blog.cr.yp.to/20140205-entropy.html"><strong>Cryptography</strong></a>, where a cryptographically-secure PRNGs
(CSPRNGs) produce output that is predictable for recipients who know
a particular secret, but not for anyone else. This article is only
concerned with plain PRNGs.</p>
  </li>
</ul>

<p>For the first three “simulation” uses, there are two primary factors
that drive the selection of a PRNG. These factors can be at odds with
each other:</p>

<ol>
  <li>
    <p>The PRNG should be <em>very</em> fast. The application should spend its
time running the actual algorithms, not generating random numbers.</p>
  </li>
  <li>
    <p>PRNG output should have robust statistical qualities. Bits should
appear to be independent and the output should closely follow the
desired distribution. Poor quality output will negatively effect
the algorithms using it. Also just as important is <a href="http://mumble.net/~campbell/2014/04/28/uniform-random-float">how you use
it</a>, but this article will focus only on generating bits.</p>
  </li>
</ol>

<p>In other situations, such as in cryptography or online gambling,
another important property is that an observer can’t learn anything
meaningful about the PRNG’s internal state from its output. For the
three simulation cases I care about, this is not a concern. Only speed
and quality properties matter.</p>

<p>Depending on the programming language, the PRNGs found in various
standard libraries may be of dubious quality. They’re slower than they
need to be, or have poorer quality than required. In some cases, such
as <code class="language-plaintext highlighter-rouge">rand()</code> in C, the algorithm isn’t specified, and you can’t rely on
it for anything outside of trivial examples. In other cases the
algorithm and behavior <em>is</em> specified, but you could easily do better
yourself.</p>

<p>My preference is to BYOPRNG: <em>Bring Your Own Pseudo-random Number
Generator</em>. You get reliable, identical output everywhere. Also, in
the case of C and C++ — and if you do it right — by embedding the PRNG
in your project, it will get inlined and unrolled, making it far more
efficient than a <a href="/blog/2016/10/27/">slow call into a dynamic library</a>.</p>

<p>A fast PRNG is going to be small, making it a great candidate for
embedding as, say, a header library. That leaves just one important
question, “Can the PRNG be small <em>and</em> have high quality output?” In
the 21st century, the answer to this question is an emphatic “yes!”</p>

<p>For the past few years my main go to for a drop-in PRNG has been
<a href="https://en.wikipedia.org/wiki/Xorshift">xorshift*</a>. The body of the function is 6 lines of C, and its
entire state is a 64-bit integer, directly seeded. However, there are a
number of choices here, including other variants of Xorshift. How do I
know which one is best? The only way to know is to test it, hence my
64-bit PRNG shootout:</p>

<ul>
  <li><a href="https://github.com/skeeto/prng64-shootout"><strong>64-bit PRNG Shootout</strong></a></li>
</ul>

<p>Sure, there <a href="http://xoroshiro.di.unimi.it/">are other such shootouts</a>, but they’re all missing
something I want to measure. I also want to test in an environment very
close to how I’d use these PRNGs myself.</p>

<h3 id="shootout-results">Shootout results</h3>

<p>Before getting into the details of the benchmark and each generator,
here are the results. These tests were run on an i7-6700 (Skylake)
running Linux 4.9.0.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060
</code></pre></div></div>

<p><strong>The clear winner is <a href="http://xoroshiro.di.unimi.it/">xoroshiro128+</a></strong>, with a function body of
just 7 lines of C. It’s clearly the fastest, and the output had no
observed statistical failures. However, that’s not the whole story. A
couple of the other PRNGS have advantages that situationally makes
them better suited than xoroshiro128+. I’ll go over these in the
discussion below.</p>

<p>These two versions of GCC and Clang were chosen because these are the
latest available in Debian 9 “Stretch.” It’s easy to build and run the
benchmark yourself if you want to try a different version.</p>

<h3 id="speed-benchmark">Speed benchmark</h3>

<p>In the speed benchmark, the PRNG is initialized, a 1-second <code class="language-plaintext highlighter-rouge">alarm(1)</code>
is set, then the PRNG fills a large <code class="language-plaintext highlighter-rouge">volatile</code> buffer of 64-bit unsigned
integers again and again as quickly as possible until the alarm fires.
The amount of memory written is measured as the PRNG’s speed.</p>

<p>The baseline “PRNG” writes zeros into the buffer. This represents the
absolute speed limit that no PRNG can exceed.</p>

<p>The purpose for making the buffer <code class="language-plaintext highlighter-rouge">volatile</code> is to force the entire
output to actually be “consumed” as far as the compiler is concerned.
Otherwise the compiler plays nasty tricks to make the program do as
little work as possible. Another way to deal with this would be to
<code class="language-plaintext highlighter-rouge">write(2)</code> buffer, but of course I didn’t want to introduce
unnecessary I/O into a benchmark.</p>

<p>On Linux, SIGALRM was impressively consistent between runs, meaning it
was perfectly suitable for this benchmark. To account for any process
scheduling wonkiness, the bench mark was run 8 times and only the
fastest time was kept.</p>

<p>The SIGALRM handler sets a <code class="language-plaintext highlighter-rouge">volatile</code> global variable that tells the
generator to stop. The PRNG call was unrolled 8 times to avoid the
alarm check from significantly impacting the benchmark. You can see
the effect for yourself by changing <code class="language-plaintext highlighter-rouge">UNROLL</code> to 1 (i.e. “don’t
unroll”) in the code. Unrolling beyond 8 times had no measurable
effect to my tests.</p>

<p>Due to the PRNGs being inlined, this unrolling makes the benchmark
less realistic, and it shows in the results. Using <code class="language-plaintext highlighter-rouge">volatile</code> for the
buffer helped to counter this effect and reground the results. This is
a fuzzy problem, and there’s not really any way to avoid it, but I
will also discuss this below.</p>

<h3 id="statistical-benchmark">Statistical benchmark</h3>

<p>To measure the statistical quality of each PRNG — mostly as a sanity
check — the raw binary output was run through <a href="http://webhome.phy.duke.edu/~rgb/General/dieharder.php">dieharder</a> 3.31.1:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prng | dieharder -g200 -a -m4
</code></pre></div></div>

<p>This statistical analysis has no timing characteristics and the
results should be the same everywhere. You would only need to re-run
it to test with a different version of dieharder, or a different
analysis tool.</p>

<p>There’s not much information to glean from this part of the shootout.
It mostly confirms that all of these PRNGs would work fine for
simulation purposes. The WEAK results are not very significant and is
only useful for breaking ties. Even a true RNG will get some WEAK
results. For example, the <a href="https://en.wikipedia.org/wiki/RdRand">x86 RDRAND</a> instruction (not
included in actual shootout) got 7 WEAK results in my tests.</p>

<p>The FAIL results are more significant, but a single failure doesn’t
mean much. A non-failing PRNG should be preferred to an otherwise
equal PRNG with a failure.</p>

<h3 id="individual-prngs">Individual PRNGs</h3>

<p>Admittedly the definition for “64-bit PRNG” is rather vague. My high
performance targets are all 64-bit platforms, so the highest PRNG
throughput will be built on 64-bit operations (<a href="/blog/2015/07/10/">if not wider</a>).
The original plan was to focus on PRNGs built from 64-bit operations.</p>

<p>Curiosity got the best of me, so I included some PRNGs that don’t use
<em>any</em> 64-bit operations. I just wanted to see how they stacked up.</p>

<h4 id="blowfish">Blowfish</h4>

<p>One of the reasons I <a href="/blog/2017/09/15/">wrote a Blowfish implementation</a> was to
evaluate its performance and statistical qualities, so naturally I
included it in the benchmark. It only uses 32-bit addition and 32-bit
XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit
integer. There are two different properties that combine to make four
variants in the benchmark: number of rounds and block mode.</p>

<p>Blowfish normally uses 16 rounds. This makes it a lot slower than a
non-cryptographic PRNG but gives it a <em>security margin</em>. I don’t care
about the security margin, so I included a 4-round variant. At
expected, it’s about four times faster.</p>

<p>The other feature I tested is the block mode: <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CBC">Cipher Block
Chaining</a> (CBC) versus <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29">Counter</a> (CTR) mode. In CBC mode it
encrypts zeros as plaintext. This just means it’s encrypting its last
output. The ciphertext is the PRNG’s output.</p>

<p>In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster
than CBC in the 16-round variant and 23% faster in the 4-round variant.
The reason is simple, and it’s in part an artifact of unrolling the
generation loop in the benchmark.</p>

<p>In CBC mode, each output depends on the previous, but in CTR mode all
blocks are independent. Work can begin on the next output before the
previous output is complete. The x86 architecture uses out-of-order
execution to achieve many of its performance gains: Instructions may
be executed in a different order than they appear in the program,
though their observable effects must <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act/">generally be ordered
correctly</a>. Breaking dependencies between instructions allows
out-of-order execution to be fully exercised. It also gives the
compiler more freedom in instruction scheduling, though the <code class="language-plaintext highlighter-rouge">volatile</code>
accesses cannot be reordered with respect to each other (hence it
helping to reground the benchmark).</p>

<p>Statistically, the 4-round cipher was not significantly worse than the
16-round cipher. For simulation purposes the 4-round cipher would be
perfectly sufficient, though xoroshiro128+ is still more than 9 times
faster without sacrificing quality.</p>

<p>On the other hand, CTR mode had a single failure in both the 4-round
(dab_filltree2) and 16-round (dab_filltree) variants. At least for
Blowfish, is there something that makes CTR mode less suitable than CBC
mode as a PRNG?</p>

<p>In the end Blowfish is too slow and too complicated to serve as a
simulation PRNG. This was entirely expected, but it’s interesting to see
how it stacks up.</p>

<h4 id="mersenne-twister-mt19937-64">Mersenne Twister (MT19937-64)</h4>

<p>Nobody ever got fired for choosing <a href="https://en.wikipedia.org/wiki/Mersenne_Twister">Mersenne Twister</a>. It’s the
classical choice for simulations, and is still usually recommended to
this day. However, Mersenne Twister’s best days are behind it. I
tested the 64-bit variant, MT19937-64, and there are four problems:</p>

<ul>
  <li>
    <p>It’s between 1/4 and 1/5 the speed of xoroshiro128+.</p>
  </li>
  <li>
    <p>It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.</p>
  </li>
  <li>
    <p>Its implementation is three times bigger than xoroshiro128+, and much
more complicated.</p>
  </li>
  <li>
    <p>It had one statistical failure (dab_filltree2).</p>
  </li>
</ul>

<p>Curiously my implementation is 16% faster with Clang than GCC. Since
Mersenne Twister isn’t seriously in the running, I didn’t take time to
dig into why.</p>

<p>Ultimately I would never choose Mersenne Twister for anything anymore.
This was also not surprising.</p>

<h4 id="permuted-congruential-generator-pcg">Permuted Congruential Generator (PCG)</h4>

<p>The <a href="http://www.pcg-random.org/">Permuted Congruential Generator</a> (PCG) has some really
interesting history behind it, particularly with its somewhat <a href="http://www.pcg-random.org/paper.html">unusual
paper</a>, controversial for both its excessive length (58 pages)
and informal style. It’s in close competition with Xorshift and
xoroshiro128+. I was really interested in seeing how it stacked up.</p>

<p>PCG is really just a Linear Congruential Generator (LCG) that doesn’t
output the lowest bits (too poor quality), and has an extra
permutation step to make up for the LCG’s other weaknesses. I included
two variants in my benchmark: the official PCG and a “simplified” PCG
(sPCG) with a simple permutation step. sPCG is just the first PCG
presented in the paper (34 pages in!).</p>

<p>Here’s essentially what the simplified version looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span>
<span class="nf">spcg32</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">*</span><span class="n">s</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The third line with the modular multiplication and addition is the
LCG. The bit shift is the permutation. This PCG uses the most
significant three bits of the result to determine which 32 bits to
output. That’s <em>the</em> novel component of PCG.</p>

<p>The two constants are entirely my own devising. It’s two 64-bit primes
generated using Emacs’ <code class="language-plaintext highlighter-rouge">M-x calc</code>: <code class="language-plaintext highlighter-rouge">2 64 ^ k r k n k p k p k p</code>.</p>

<p>Heck, that’s so simple that I could easily memorize this and code it
from scratch on demand. Key takeaway: This is <strong>one way that PCG is
situationally better than xoroshiro128+</strong>. In a pinch I could use Emacs
to generate a couple of primes and code the rest from memory. If you
participate in coding competitions, take note.</p>

<p>However, you probably also noticed PCG only generates 32-bit integers
despite using 64-bit operations. To properly generate a 64-bit value
we’d need 128-bit operations, which would need to be implemented in
software.</p>

<p>Instead, I doubled up on everything to run two PRNGs in parallel.
Despite the doubling in state size, the period doesn’t get any larger
since the PRNGs don’t interact with each other. We get something in
return, though. Remember what I said about out-of-order execution?
Except for the last step combining their results, since the two PRNGs
are independent, doubling up shouldn’t <em>quite</em> halve the performance,
particularly with the benchmark loop unrolling business.</p>

<p>Here’s my doubled-up version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">spcg64</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">m</span>  <span class="o">=</span> <span class="mh">0x9b60933458e17d7d</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a0</span> <span class="o">=</span> <span class="mh">0xd737232eeccdf7ed</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">a1</span> <span class="o">=</span> <span class="mh">0x8b260b70b8e98891</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">p0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p1</span> <span class="o">*</span> <span class="n">m</span> <span class="o">+</span> <span class="n">a1</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r0</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">r1</span> <span class="o">=</span> <span class="mi">29</span> <span class="o">-</span> <span class="p">(</span><span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="mi">61</span><span class="p">);</span>
    <span class="kt">uint64_t</span> <span class="n">high</span> <span class="o">=</span> <span class="n">p0</span> <span class="o">&gt;&gt;</span> <span class="n">r0</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">low</span>  <span class="o">=</span> <span class="n">p1</span> <span class="o">&gt;&gt;</span> <span class="n">r1</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">high</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span><span class="p">)</span> <span class="o">|</span> <span class="n">low</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The “full” PCG has some extra shifts that makes it 25% (GCC) to 50%
(Clang) slower than the “simplified” PCG, but it does halve the WEAK
results.</p>

<p>In this 64-bit form, both are significantly slower than xoroshiro128+.
However, if you find yourself only needing 32 bits at a time (always
throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is
faster than using xoroshiro128+ and throwing away half its output.</p>

<h4 id="rc4">RC4</h4>

<p>This is another CSPRNG where I was curious how it would stack up. It
only uses 8-bit operations, and it generates a 64-bit integer one byte
at a time. It’s the slowest after 16-round Blowfish and generally not
useful as a simulation PRNG.</p>

<h4 id="xoroshiro128">xoroshiro128+</h4>

<p>xoroshiro128+ is the obvious winner in this benchmark and it seems to be
the best 64-bit simulation PRNG available. If you need a fast, quality
PRNG, just drop these 11 lines into your C or C++ program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xoroshiro128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">s0</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">s1</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">result</span> <span class="o">=</span> <span class="n">s0</span> <span class="o">+</span> <span class="n">s1</span><span class="p">;</span>
    <span class="n">s1</span> <span class="o">^=</span> <span class="n">s0</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">s0</span> <span class="o">&lt;&lt;</span> <span class="mi">55</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s0</span> <span class="o">&gt;&gt;</span> <span class="mi">9</span><span class="p">))</span> <span class="o">^</span> <span class="n">s1</span> <span class="o">^</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">14</span><span class="p">);</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&lt;&lt;</span> <span class="mi">36</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">s1</span> <span class="o">&gt;&gt;</span> <span class="mi">28</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s one important caveat: <strong>That 16-byte state must be
well-seeded.</strong> Having lots of zero bytes will lead <em>terrible</em> initial
output until the generator mixes it all up. Having all zero bytes will
completely break the generator. If you’re going to seed from, say, the
unix epoch, then XOR it with 16 static random bytes.</p>

<h4 id="xorshift128-and-xorshift64">xorshift128+ and xorshift64*</h4>

<p>These generators are closely related and, like I said, xorshift64* was
what I used for years. Looks like it’s time to retire it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift64star</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">25</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">27</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">UINT64_C</span><span class="p">(</span><span class="mh">0x2545f4914f6cdd1d</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will
tolerate weak seeding so long as it’s not literally zero. Zero will also
break this generator.</p>

<p>If it weren’t for xoroshiro128+, then xorshift128+ would have been the
winner of the benchmark and my new favorite choice.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">xorshift128plus</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">x</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">y</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
    <span class="n">x</span> <span class="o">^=</span> <span class="n">x</span> <span class="o">&lt;&lt;</span> <span class="mi">23</span><span class="p">;</span>
    <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span> <span class="o">^</span> <span class="n">y</span> <span class="o">^</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;&gt;</span> <span class="mi">17</span><span class="p">)</span> <span class="o">^</span> <span class="p">(</span><span class="n">y</span> <span class="o">&gt;&gt;</span> <span class="mi">26</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a lot like xoroshiro128+, including the need to be well-seeded,
but it’s just slow enough to lose out. There’s no reason to use
xorshift128+ instead of xoroshiro128+.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My own takeaway (until I re-evaluate some years in the future):</p>

<ul>
  <li>The best 64-bit simulation PRNG is xoroshiro128+.</li>
  <li>“Simplified” PCG can be useful in a pinch.</li>
  <li>When only 32-bit integers are necessary, use PCG.</li>
</ul>

<p>Things can change significantly between platforms, though. Here’s the
shootout on a ARM Cortex-A53:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080
</code></pre></div></div>

<p>LLVM is not as mature on this platform, but, with GCC, both
xoroshiro128+ and xorshift128+ matched the baseline! It seems memory
is the bottleneck.</p>

<p>So don’t necessarily take my word for it. You can run this shootout in
your own environment — perhaps even tossing in more PRNGs — to find
what’s appropriate for your own situation.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Blowpipe: a Blowfish-encrypted, Authenticated Pipe</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/15/"/>
    <id>urn:uuid:1cddecb9-44b1-346c-ded6-c099069ce013</id>
    <updated>2017-09-15T23:59:59Z</updated>
    <category term="crypto"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p><a href="https://github.com/skeeto/blowpipe"><strong>Blowpipe</strong></a> is a <em>toy</em> crypto tool that creates a
<a href="https://www.schneier.com/academic/blowfish/">Blowfish</a>-encrypted pipe. It doesn’t open any files and instead
encrypts and decrypts from standard input to standard output. This
pipe can encrypt individual files or even encrypt a network
connection (à la netcat).</p>

<p>Most importantly, since Blowpipe is intended to be used as a pipe
(duh), it will <em>never</em> output decrypted plaintext that hasn’t been
<em>authenticated</em>. That is, it will detect tampering of the encrypted
stream and truncate its output, reporting an error, without producing
the manipulated data. Some very similar tools that <em>aren’t</em> considered
toys lack this important feature, such as <a href="http://loop-aes.sourceforge.net/aespipe.README">aespipe</a>.</p>

<h3 id="purpose">Purpose</h3>

<p>Blowpipe came about because I wanted to study Blowfish, a 64-bit block
cipher designed by Bruce Schneier in 1993. It’s played an important
role in the history of cryptography and has withstood cryptanalysis
for 24 years. Its major weakness is its small block size, leaving it
vulnerable to birthday attacks regardless of any other property of the
cipher. Even in 1993 the 64-bit block size was a bit on the small
side, but Blowfish was intended as a drop-in replacement for the Data
Encryption Standard (DES) and the International Data Encryption
Algorithm (IDEA), other 64-bit block ciphers.</p>

<p>The main reason I’m calling this program a toy is that, outside of
legacy interfaces, it’s simply <a href="https://sweet32.info/">not appropriate to deploy a 64-bit
block cipher in 2017</a>. Blowpipe shouldn’t be used to encrypt
more than a few tens of GBs of data at a time. Otherwise I’m <em>fairly</em>
confident in both my message construction and my implementation. One
detail is a little uncertain, and I’ll discuss it later when
describing message format.</p>

<p>A tool that I <em>am</em> confident about is <a href="https://github.com/skeeto/enchive">Enchive</a>, though since
it’s <a href="/blog/2017/03/12/">intended for file encryption</a>, it’s not appropriate for use
as a pipe. It doesn’t authenticate until after it has produced most of
its output. Enchive does try its best to delete files containing
unauthenticated output when authentication fails, but this doesn’t
prevent you from consuming this output before it can be deleted,
particularly if you pipe the output into another program.</p>

<h3 id="usage">Usage</h3>

<p>As you might expect, there are two modes of operation: encryption (<code class="language-plaintext highlighter-rouge">-E</code>)
and decryption (<code class="language-plaintext highlighter-rouge">-D</code>). The simplest usage is encrypting and decrypting a
file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ blowpipe -E &lt; data.gz &gt; data.gz.enc
$ blowpipe -D &lt; data.gz.enc | gunzip &gt; data.txt
</code></pre></div></div>

<p>In both cases you will be prompted for a passphrase which can be up to
72 bytes in length. The only verification for the key is the first
Message Authentication Code (MAC) in the datastream, so Blowpipe
cannot tell the difference between damaged ciphertext and an incorrect
key.</p>

<p>In a script it would be smart to check Blowpipe’s exit code after
decrypting. The output will be truncated should authentication fail
somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t
clean up for you.</p>

<p>Another use case is securely transmitting files over a network with
netcat. In this example I’ll use a pre-shared key file, <code class="language-plaintext highlighter-rouge">keyfile</code>.
Rather than prompt for a key, Blowpipe will use the raw bytes of a given
file. Here’s how I would create a key file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ head -c 32 /dev/urandom &gt; keyfile
</code></pre></div></div>

<p>First the receiver listens on a socket (<code class="language-plaintext highlighter-rouge">bind(2)</code>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nc -lp 2000 | blowpipe -D -k keyfile &gt; data.zip
</code></pre></div></div>

<p>Then the sender connects (<code class="language-plaintext highlighter-rouge">connect(2)</code>) and pipes Blowpipe through:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ blowpipe -E -k keyfile &lt; data.zip | nc -N hostname 2000
</code></pre></div></div>

<p>If all went well, Blowpipe will exit with 0 on the receiver side.</p>

<p>Blowpipe doesn’t buffer its output (but see <code class="language-plaintext highlighter-rouge">-w</code>). It performs one
<code class="language-plaintext highlighter-rouge">read(2)</code>, encrypts whatever it got, prepends a MAC, and calls
<code class="language-plaintext highlighter-rouge">write(2)</code> on the result. This means it can comfortably transmit live
sensitive data across the network:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000
</code></pre></div></div>

<p>Kernel messages will appear on the other end as they’re produced by
<code class="language-plaintext highlighter-rouge">dmesg</code>. Though keep in mind that the size of each line will be known to
eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to
disguise the length. Those lengths may leak useful information.</p>

<h3 id="blowfish">Blowfish</h3>

<p>This whole project started when I wanted to <a href="/blog/2017/09/21/">play with Blowfish</a>
as a small drop-in library. I wasn’t satisfied with <a href="https://www.schneier.com/academic/blowfish/download.html">the
selection</a>, so I figured it would be a good exercise to write my
own. Besides, the <a href="https://www.schneier.com/academic/archives/1994/09/description_of_a_new.html">specification</a> is both an enjoyable and easy
read (and recommended). It justifies the need for a new cipher and
explains the various design decisions.</p>

<p>I coded from the specification, including writing <a href="https://github.com/skeeto/blowpipe/blob/master/gen-tables.sh">a script</a>
to generate the subkey initialization tables. Subkeys are initialized
to the binary representation of pi (the first ~10,000 decimal digits).
After a couple hours of work I hooked up the official test vectors to
see how I did, and all the tests passed on the first run. This wasn’t
reasonable, so I spent awhile longer figuring out how I screwed up my
tests. Turns out I absolutely <em>nailed it</em> on my first shot. It’s a
really great sign for Blowfish that it’s so easy to implement
correctly.</p>

<p>Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of
storage. The key schedule is unusually complex: Subkeys are repeatedly
encrypted with themselves as they are being computed. This complexity
inspired the <a href="https://www.usenix.org/legacy/events/usenix99/provos/provos_html/node1.html">bcrypt</a> password hashing scheme, which
essentially works by iterating the key schedule many times in a loop,
then encrypting a constant 24-byte string. My bcrypt implementation
wasn’t nearly as successful on my first attempt, and it took hours of
debugging in order to match OpenBSD’s outputs.</p>

<p>The encryption and decryption algorithms are nearly identical, as is
typical for, and a feature of, Feistel ciphers. There are no branches
(preventing some side-channel attacks), and the only operations are
32-bit XOR and 32-bit addition. This makes it ideal for implementation
on 32-bit computers.</p>

<p>One tricky point is that encryption and decryption operate on a pair
of 32-bit integers (another giveaway that it’s a Feistel cipher). To
put the cipher to practical use, these integers have to be <a href="/blog/2016/11/22/">serialized
into a byte stream</a>. The specification doesn’t choose a byte
order, even for mixing the key into the subkeys. The official test
vectors are also 32-bit integers, not byte arrays. An implementer
could choose little endian, big endian, or even something else.</p>

<p>However, there’s one place in which this decision <em>is</em> formally made:
the official test vectors mix the key into the first subkey in big
endian byte order. By luck I happened to choose big endian as well,
which is why my tests passed on the first try. OpenBSD’s version of
bcrypt also uses big endian for all integer encoding steps, further
cementing big endian as the standard way to encode Blowfish integers.</p>

<h3 id="blowfish-library">Blowfish library</h3>

<p>The Blowpipe repository contains a ready-to-use, public domain Blowfish
library written in strictly conforming C99. The interface is just three
functions:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">blowfish_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">int</span> <span class="n">len</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">blowfish_encrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">blowfish_decrypt</span><span class="p">(</span><span class="k">struct</span> <span class="n">blowfish</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>Technically the key can be up to 72 bytes long, but the last 16 bytes
have an incomplete effect on the subkeys, so only the first 56 bytes
should matter. Since bcrypt runs the key schedule multiple times, all
72 bytes have full effect.</p>

<p>The library also includes a bcrypt implementation, though it will only
produce the raw password hash, not the base-64 encoded form. The main
reason for including bcrypt is to support Blowpipe.</p>

<h3 id="message-format">Message format</h3>

<p>The main goal of Blowpipe was to build a robust, authenticated
encryption tool using <em>only</em> Blowfish as a cryptographic primitive.</p>

<ol>
  <li>
    <p>It uses bcrypt with a moderately-high cost as a key derivation
function (KDF). Not terrible, but this is not a memory hard KDF,
which is important for protecting against cheap hardware brute force
attacks.</p>
  </li>
  <li>
    <p>Encryption is Blowfish in “counter” <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29">CTR mode</a>. A 64-bit
counter is incremented and encrypted, producing a keystream. The
plaintext is XORed with this keystream like a stream cipher. This
allows the last block to be truncated when output and eliminates
some padding issues. Since CRT mode is trivially malleable, the MAC
becomes even more important. In CTR mode, <code class="language-plaintext highlighter-rouge">blowfish_decrypt()</code> is
never called. In fact, Blowpipe never uses it.</p>
  </li>
  <li>
    <p>The authentication scheme is Blowfish-CBC-MAC with a unique key and
<a href="https://moxie.org/blog/the-cryptographic-doom-principle/">encrypt-then-authenticate</a> (something I harmlessly got wrong
with Enchive). It essentially encrypts the ciphertext again with a
different key, but in <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#CBC">Cipher Block Chaining mode</a> (CBC), but
it only saves the final block. The final block is prepended to the
ciphertext as the MAC. On decryption the same block is computed
again to ensure that it matches. Only someone who knows the MAC key
can compute it.</p>
  </li>
</ol>

<p>Of all three Blowfish uses, I’m least confident about authentication.
<a href="https://blog.cryptographyengineering.com/2013/02/15/why-i-hate-cbc-mac/">CBC-MAC is tricky to get right</a>, though I am following the
rules: fixed length messages using a different key than encryption.</p>

<p>Wait a minute. Blowpipe is pipe-oriented and can output data without
buffering the entire pipe. How can there be fixed-length messages?</p>

<p>The pipe datastream is broken into 64kB <em>chunks</em>. Each chunk is
authenticated with its own MAC. Both the MAC and chunk length are
written in the chunk header, and the length is authenticated by the
MAC. Furthermore, just like the keystream, the MAC is continued from
previous chunk, preventing chunks from being reordered. Blowpipe can
output the content of a chunk and discard it once it’s been
authenticated. If any chunk fails to authenticate, it aborts.</p>

<p><img src="/img/diagram/blowpipe.svg" alt="" /></p>

<p>This also leads to another useful trick: The pipe is terminated with a
zero length chunk, preventing an attacker from appending to the
datastream. Everything after the zero-length chunk is discarded. Since
the length is authenticated by the MAC, the attacker also cannot
truncate the pipe since that would require knowledge of the MAC key.</p>

<p>The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1
byte for the bcrypt cost. The salt is like an initialization vector (IV)
that allows keys to be safely reused in different Blowpipe instances.
The cost byte is the only distinguishing byte in the stream. Since even
the chunk lengths are encrypted, everything else in the datastream
should be indistinguishable from random data.</p>

<h3 id="portability">Portability</h3>

<p>Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I
initially wrote it for POSIX (on Linux) of course, but I took an unusual
approach when it came time to port it to Windows. Normally I’d <a href="/blog/2017/03/01/">invent a
generic OS interface</a> that makes the appropriate host system
calls. This time I kept the POSIX interface (<code class="language-plaintext highlighter-rouge">read(2)</code>, <code class="language-plaintext highlighter-rouge">write(2)</code>,
<code class="language-plaintext highlighter-rouge">open(2)</code>, etc.) and implemented the tiny subset of POSIX that I needed
in terms of Win32. That implementation can be found under <code class="language-plaintext highlighter-rouge">w32-compat/</code>.
I even dropped in a copy of <a href="https://github.com/skeeto/getopt">my own <code class="language-plaintext highlighter-rouge">getopt()</code></a>.</p>

<p>One really cool feature of this technique is that, on Windows, Blowpipe
will still “open” <code class="language-plaintext highlighter-rouge">/dev/urandom</code>. It’s intercepted by my own <code class="language-plaintext highlighter-rouge">open(2)</code>,
which in response to that filename actually calls
<a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa379886(v=vs.85).aspx"><code class="language-plaintext highlighter-rouge">CryptAcquireContext()</code></a> and pretends like it’s a file. It’s all
hidden behind the file descriptor. That’s the unix way.</p>

<p>I’m considering giving Enchive the same treatment since it would simply
and reduce much of the interface code. In fact, this project has taught
me a number of ways that Enchive could be improved. That’s the value of
writing “toys” such as Blowpipe.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Gap Buffers Are Not Optimized for Multiple Cursors</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/09/07/"/>
    <id>urn:uuid:8c80d068-2342-356a-9b78-f180806418a4</id>
    <updated>2017-09-07T01:34:04Z</updated>
    <category term="emacs"/><category term="c"/><category term="vim"/>
    <content type="html">
      <![CDATA[<p>Gap buffers are a common data structure for representing a text buffer
in a text editor. Emacs famously uses gap buffers — long-standing proof
that gap buffers are a perfectly sufficient way to represent a text
buffer.</p>

<ul>
  <li>
    <p>Gap buffers are <em>very</em> easy to implement. A bare minimum
implementation is about 60 lines of C.</p>
  </li>
  <li>
    <p>Gap buffers are especially efficient for the majority of typical
editing commands, which tend to be clustered in a small area.</p>
  </li>
  <li>
    <p>Except for the gap, the content of the buffer is contiguous, making
the search and display implementations simpler and more efficient.
There’s also the potential for most of the gap buffer to be
memory-mapped to the original file, though typical encoding and
decoding operations prevent this from being realized.</p>
  </li>
  <li>
    <p>Due to having contiguous content, saving a gap buffer is basically
just two <code class="language-plaintext highlighter-rouge">write(2)</code> system calls. (Plus <a href="https://www.youtube.com/watch?v=LMe7hf2G1po"><code class="language-plaintext highlighter-rouge">fsync(2)</code>, etc.</a>)</p>
  </li>
</ul>

<p>A gap buffer is really a pair of buffers where one buffer holds all of
the content before the cursor (or <em>point</em> for Emacs), and the other
buffer holds the content after the cursor. When the cursor is moved
through the buffer, characters are copied from one buffer to the
other. Inserts and deletes close to the gap are very efficient.</p>

<p>Typically it’s implemented as a single large buffer, with the
pre-cursor content at the beginning, the post-cursor content at the
end, and the gap spanning the middle. Here’s an illustration:</p>

<p><img src="/img/gap-buffer/intro.gif" alt="" /></p>

<p>The top of the animation is the display of the text content and cursor
as the user would see it. The bottom is the gap buffer state, where
each character is represented as a gray block, and a literal gap for
the cursor.</p>

<p>Ignoring for a moment more complicated concerns such as undo and
Unicode, a gap buffer could be represented by something as simple as
the following:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">gapbuf</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">total</span><span class="p">;</span>  <span class="cm">/* total size of buf */</span>
    <span class="kt">size_t</span> <span class="n">front</span><span class="p">;</span>  <span class="cm">/* size of content before cursor */</span>
    <span class="kt">size_t</span> <span class="n">gap</span><span class="p">;</span>    <span class="cm">/* size of the gap */</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This is close to <a href="http://git.savannah.gnu.org/cgit/emacs.git/tree/src/buffer.h?h=emacs-25.2#n425">how Emacs represents it</a>. In the structure
above, the size of the content after the cursor isn’t tracked directly,
but can be computed on the fly from the other three quantities. That is
to say, this data structure is <em>normalized</em>.</p>

<p>As an optimization, the cursor could be tracked separately from the
gap such that non-destructive cursor movement is essentially free. The
difference between cursor and gap would only need to be reconciled for
a destructive change — an insert or delete.</p>

<p>A gap buffer certainly isn’t the only way to do it. For example, the
original <a href="https://ecc-comp.blogspot.com/2015/05/a-brief-glance-at-how-5-text-editors.html">vi used an array of lines</a>, which sort of explains
some of its quirky <a href="http://vimhelp.appspot.com/options.txt.html#'backspace'">line-oriented idioms</a>. The BSD clone of vi, nvi,
<a href="https://en.wikipedia.org/wiki/Nvi">uses an entire database</a> to represent buffers. Vim uses a fairly
complex <a href="https://en.wikipedia.org/wiki/Rope_(data_structure)">rope</a>-like <a href="https://github.com/vim/vim/blob/e723c42836d971180d1bf9f98916966c5543fff1/src/memline.c">data structure</a> with <a href="http://www.free-soft.org/FSM/english/issue01/vim.html">page-oriented
blocks</a>, which may be stored out-of-order in its swap file.</p>

<h3 id="multiple-cursors">Multiple cursors</h3>

<p><a href="http://emacsrocks.com/e13.html"><em>Multiple cursors</em></a> is fairly recent text editor invention that
has gained a lot of popularity recent years. It seems every major
editor either has the feature built in or a readily-available
extension. I myself used Magnar Sveen’s <a href="https://github.com/magnars/multiple-cursors.el">well-polished package</a>
for several years. Though obviously the concept didn’t originate in
Emacs or else it would have been called <em>multiple points</em>, which
doesn’t quite roll off the tongue quite the same way.</p>

<p>The concept is simple: If the same operation needs to done in many
different places in a buffer, you place a cursor at each position, then
drive them all in parallel using the same commands. It’s super flashy
and great for impressing all your friends.</p>

<p>However, as a result of <a href="/blog/2017/04/01/">improving my typing skills</a>, I’ve
come to the conclusion that <a href="https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db">multiple cursors is all hat and no
cattle</a>. It doesn’t compose well with other editing commands, it
doesn’t scale up to large operations, and it’s got all sorts of flaky
edge cases (off-screen cursors). Nearly anything you can do with
multiple cursors, you can do better with old, well-established editing
paradigms.</p>

<p>Somewhere around 99% of my multiple cursors usage was adding a common
prefix to a contiguous serious of lines. As similar brute force
options, Emacs already has rectangular editing, and Vim already has
visual block mode.</p>

<p>The most sophisticated, flexible, and robust alternative is a good old
macro. You can play it back anywhere it’s needed. You can zip it across
a huge buffer. The only downside is that it’s less flashy and so you’ll
get invited to a slightly smaller number of parties.</p>

<p>But if you don’t buy my arguments about multiple cursors being
tasteless, there’s still a good technical argument: <strong>Gap buffers are
not designed to work well in the face of multiple cursors!</strong></p>

<p>For example, suppose we have a series of function calls and we’d like to
add the same set of arguments to each. It’s a classic situation for a
macro or for multiple cursors. Here’s the original code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>foo();
bar();
baz();
</code></pre></div></div>

<p>The example is tiny so that it will fit in the animations to come.
Here’s the desired code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>foo(x, y);
bar(x, y);
baz(x, y);
</code></pre></div></div>

<p>With multiple cursors you would place a cursor inside each set of
parenthesis, then type <code class="language-plaintext highlighter-rouge">x, y</code>. Visually it looks something like this:</p>

<p><img src="/img/gap-buffer/illusion.gif" alt="" /></p>

<p>Text is magically inserted in parallel in multiple places at a time.
However, if this is a text editor that uses a gap buffer, the
situation underneath isn’t quite so magical. The entire edit doesn’t
happen at once. First the <code class="language-plaintext highlighter-rouge">x</code> is inserted in each location, then the
comma, and so on. The edits are not clustered so nicely.</p>

<p>From the gap buffer’s point of view, here’s what it looks like:</p>

<p><img src="/img/gap-buffer/multicursors.gif" alt="" /></p>

<p>For every individual character insertion the buffer has to visit each
cursor in turn, performing lots of copying back and forth. The more
cursors there are, the worse it gets. For an edit of length <code class="language-plaintext highlighter-rouge">n</code> with
<code class="language-plaintext highlighter-rouge">m</code> cursors, that’s <code class="language-plaintext highlighter-rouge">O(n * m)</code> calls to <code class="language-plaintext highlighter-rouge">memmove(3)</code>. Multiple cursors
scales badly.</p>

<p>Compare that to the old school hacker who can’t be bothered with
something as tacky and <em>modern</em> (eww!) as multiple cursors, instead
choosing to record a macro, then play it back:</p>

<p><img src="/img/gap-buffer/macros.gif" alt="" /></p>

<p>The entire edit is done locally before moving on to the next location.
It’s perfectly in tune with the gap buffer’s expectations, only needing
<code class="language-plaintext highlighter-rouge">O(m)</code> calls to <code class="language-plaintext highlighter-rouge">memmove(3)</code>. Most of the work flows neatly into the
gap.</p>

<p>So, don’t waste your time with multiple cursors, especially if you’re
using a gap buffer text editor. Instead get more comfortable with your
editor’s macro feature. If your editor doesn’t have a good macro
feature, get a new editor.</p>

<p>If you want to make your own gap buffer animations, here’s the source
code. It includes a tiny gap buffer implementation:</p>

<ul>
  <li><a href="https://github.com/skeeto/gap-buffer-animator">https://github.com/skeeto/gap-buffer-animator</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>A Tutorial on Portable Makefiles</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/08/20/"/>
    <id>urn:uuid:dc6580f0-1703-389b-7bb2-ac29899fd22c</id>
    <updated>2017-08-20T03:03:51Z</updated>
    <category term="tutorial"/><category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>In my first decade writing Makefiles, I developed the bad habit of
liberally using GNU Make’s extensions. I didn’t know the line between
GNU Make and the portable features guaranteed by POSIX. Usually it
didn’t matter much, but it would become an annoyance when building on
non-Linux systems, such as on the various BSDs. I’d have to specifically
install GNU Make, then remember to invoke it (i.e. as <code class="language-plaintext highlighter-rouge">gmake</code>) instead
of the system’s make.</p>

<p>I’ve since become familiar and comfortable with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html">make’s official
specification</a>, and I’ve spend the last year writing strictly
portable Makefiles. Not only has are my builds now portable across all
unix-like systems, my Makefiles are cleaner and more robust. Many of the
common make extensions — conditionals in particular — lead to fragile,
complicated Makefiles and are best avoided anyway. It’s important to be
able to trust your build system to do its job correctly.</p>

<p><strong>This tutorial should be suitable for make beginners who have never
written their own Makefiles before, as well as experienced developers
who want to learn how to write portable Makefiles.</strong> Regardless, in
order to understand the examples you must be familiar with the usual
steps for building programs on the command line (compiler, linker,
object files, etc.). I’m not going to suggest any fancy tricks nor
provide any sort of standard starting template. Makefiles should be dead
simple when the project is small, and grow in a predictable, clean
fashion alongside the project.</p>

<p>I’m not going to cover every feature. You’ll need to read the
specification for yourself to learn it all. This tutorial will go over
the important features as well as the common conventions. It’s important
to follow established conventions so that people using your Makefiles
will know what to expect and how to accomplish the basic tasks.</p>

<p>If you’re running Debian, or a Debian derivative such as Ubuntu, the
<code class="language-plaintext highlighter-rouge">bmake</code> and <code class="language-plaintext highlighter-rouge">freebsd-buildutils</code> packages will provide the <code class="language-plaintext highlighter-rouge">bmake</code> and
<code class="language-plaintext highlighter-rouge">fmake</code> programs respectively. These alternative make implementations
are very useful for testing your Makefiles’ portability, should you
accidentally make use of a GNU Make feature. It’s not perfect since each
implements some of the same extensions as GNU Make, but it will catch
some common mistakes.</p>

<h3 id="whats-in-a-makefile">What’s in a Makefile?</h3>

<blockquote>
  <p>I am free, no matter what rules surround me. If I find them tolerable,
I tolerate them; if I find them too obnoxious, I break them. I am free
because I know that I alone am morally responsible for everything I
do. ―Robert A. Heinlein</p>
</blockquote>

<p>At make’s core are one or more dependency trees, constructed from
<em>rules</em>. Each vertex in the tree is called a <em>target</em>. The final
products of the build (executable, document, etc.) are the tree roots. A
Makefile specifies the dependency trees and supplies the shell commands
to produce a target from its <em>prerequisites</em>.</p>

<p><img src="/img/make/game.svg" alt="" /></p>

<p>In this illustration, the “.c” files are source files that are written
by hand, not generated by commands, so they have no prerequisites. The
syntax for specifying one or more edges in this dependency tree is
simple:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>target [target...]: [prerequisite...]
</code></pre></div></div>

<p>While technically multiple targets can be specified in a single rule,
this is unusual. Typically each target is specified in its own rule. To
specify the tree in the illustration above:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>The order of these rules doesn’t matter. The entire Makefile is parsed
before any actions are taken, so the tree’s vertices and edges can be
specified in any order. There’s one exception: the first non-special
target in a Makefile is the <em>default target</em>. This target is selected
implicitly when make is invoked without choosing a target. It should be
something sensible, so that a user can blindly run make and get a useful
result.</p>

<p>A target can be specified more than once. Any new prerequisites are
appended to the previously-given prerequisites. For example, this
Makefile is identical to the previous, though it’s typically not written
this way:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">physics.o</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>There are six <em>special targets</em> that are used to change the behavior
of make itself. All have uppercase names and start with a period.
Names fitting this pattern are reserved for use by make. According to
the standard, in order to get reliable POSIX behavior, the first
non-comment line of the Makefile <em>must</em> be <code class="language-plaintext highlighter-rouge">.POSIX</code>. Since this is a
special target, it’s not a candidate for the default target, so <code class="language-plaintext highlighter-rouge">game</code>
will remain the default target:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c</span>
</code></pre></div></div>

<p>In practice, even a simple program will have header files, and sources
that include a header file should also have an edge on the dependency
tree for it. If the header file changes, targets that include it should
also be rebuilt.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
</code></pre></div></div>

<h3 id="adding-commands-to-rules">Adding commands to rules</h3>

<p>We’ve constructed a dependency tree, but we still haven’t told make how
to actually build any targets from its prerequisites. The rules also
need to specify the shell commands that produce a target from its
prerequisites.</p>

<p>If you were to create the source files in the example and invoke make,
you will find that it actually <em>does</em> know how to build the object
files. This is because make is initially configured with certain
<em>inference rules</em>, a topic which will be covered later. For now, we’ll
add the <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> special target to the top, erasing all the built-in
inference rules.</p>

<p>Commands immediately follow the target/prerequisite line in a rule. Each
command line must start with a tab character. This can be awkward if
your text editor isn’t configured for it, and it will be awkward if you
try to copy the examples from this page.</p>

<p>Each line is run in its own shell, so be mindful of using commands like
<code class="language-plaintext highlighter-rouge">cd</code>, which won’t affect later lines.</p>

<p>The simplest thing to do is literally specify the same commands you’d
type at the shell:</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">cc</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">input.c</span>
</code></pre></div></div>

<h3 id="invoking-make-and-choosing-targets">Invoking make and choosing targets</h3>

<blockquote>
  <p>I tried to walk into Target, but I missed. ―Mitch Hedberg</p>
</blockquote>

<p>When invoking make, it accepts zero or more targets from the dependency
tree, and it will build these targets — e.g. run the commands in the
target’s rule — if the target is <em>out-of-date</em>. A target is out-of-date
if it is older than any of its prerequisites.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># build the "game" binary (default target)
$ make

# build just the object files
$ make graphics.o physics.o input.o
</code></pre></div></div>

<p>This effect cascades up the dependency tree and causes further targets
to be rebuilt until all of the requested targets are up-to-date. There’s
a lot of room for parallelism since different branches of the tree can
be updated independently. It’s common for make implementations to
support parallel builds with the <code class="language-plaintext highlighter-rouge">-j</code> option. This is non-standard, but
it’s a fantastic feature that doesn’t require anything special in the
Makefile to work correctly.</p>

<p>Similar to parallel builds is make’s <code class="language-plaintext highlighter-rouge">-k</code> (“keep going”) option, which
<em>is</em> standard. This tells make not to stop on the first error, and to
continue updating targets that are unaffected by the error. This is nice
for fully populating <a href="http://vimdoc.sourceforge.net/htmldoc/quickfix.html">Vim’s quickfix list</a> or <a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Compilation.html">Emacs’ compilation
buffer</a>.</p>

<p>It’s common to have multiple targets that should be built by default. If
the first rule selects the default target, how do we solve the problem
of needing multiple default targets? The convention is to use <em>phony
targets</em>. These are called “phony” because there is no corresponding
file, and so phony targets are never up-to-date. It’s convention for a
phony “all” target to be the default target.</p>

<p>I’ll make <code class="language-plaintext highlighter-rouge">game</code> a prerequisite of a new “all” target. More real targets
could be added as necessary to turn them into defaults. Users of this
Makefile will also expect <code class="language-plaintext highlighter-rouge">make all</code> to build the entire project.</p>

<p>Another common phony target is “clean” which removes all of the built
files. Users will expect <code class="language-plaintext highlighter-rouge">make clean</code> to delete all generated files.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">cc</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">cc</span> <span class="err">-c</span> <span class="err">input.c</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<h3 id="customize-the-build-with-macros">Customize the build with macros</h3>

<p>So far the Makefile hardcodes <code class="language-plaintext highlighter-rouge">cc</code> as the compiler, and doesn’t use any
compiler flags (warnings, optimization, hardening, etc.). The user
should be able to easily control all these things, but right now they’d
have to edit the entire Makefile to do so. Perhaps the user has both
<code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">clang</code> installed, and wants to choose one or the other
without changing which is installed as <code class="language-plaintext highlighter-rouge">cc</code>.</p>

<p>To solve this, make has <em>macros</em> that expand into strings when
referenced. The convention is to use the macro named <code class="language-plaintext highlighter-rouge">CC</code> when talking
about the C compiler, <code class="language-plaintext highlighter-rouge">CFLAGS</code> when talking about flags passed to the C
compiler, <code class="language-plaintext highlighter-rouge">LDFLAGS</code> for flags passed to the C compiler when linking, and
<code class="language-plaintext highlighter-rouge">LDLIBS</code> for flags about libraries when linking. The Makefile should
supply defaults as needed.</p>

<p>A macro is expanded with <code class="language-plaintext highlighter-rouge">$(...)</code>. It’s valid (and normal) to reference
a macro that hasn’t been defined, which will be an empty string. This
will be the case with <code class="language-plaintext highlighter-rouge">LDFLAGS</code> below.</p>

<p>Macro values can contain other macros, which will be expanded
recursively each time the macro is expanded. Some make implementations
allow the name of the macro being expanded to itself be a macro, which
<a href="/blog/2016/04/30/">is turing complete</a>, but this behavior is non-standard.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">graphics.c</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">physics.c</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">input.c</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<p>Macros are overridden by macro definitions given as command line
arguments in the form <code class="language-plaintext highlighter-rouge">name=value</code>. This allows the user to select their
own build configuration. <strong>This is one of make’s most powerful and
under-appreciated features.</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=clang CFLAGS='-O3 -march=native'
</code></pre></div></div>

<p>If the user doesn’t want to specify these macros on every invocation,
they can (cautiously) use make’s <code class="language-plaintext highlighter-rouge">-e</code> flag to set overriding macros
definitions from the environment.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export CC=clang
$ export CFLAGS=-O3
$ make -e all
</code></pre></div></div>

<p>Some make implementations have other special kinds of macro assignment
operators beyond simple assignment (<code class="language-plaintext highlighter-rouge">=</code>). These are unnecessary, so
don’t worry about them.</p>

<h3 id="inference-rules-so-that-you-can-stop-repeating-yourself">Inference rules so that you can stop repeating yourself</h3>

<blockquote>
  <p>The road itself tells us far more than signs do. ―Tom Vanderbilt,
Traffic: Why We Drive the Way We Do</p>
</blockquote>

<p>There’s repetition across the three different object files. Wouldn’t it
be nice if there was a way to communicate this pattern? Fortunately
there is, in the form of <em>inference rules</em>. It says that a target with
a certain extension, with a prerequisite with another certain extension,
is built a certain way. This will make more sense with an example.</p>

<p>In an inference rule, the target indicates the extensions. The <code class="language-plaintext highlighter-rouge">$&lt;</code>
macro expands to the prerequisite, which is essential to making
inference rules work generically. Unfortunately this macro is not
available in target rules, as much as that would be useful.</p>

<p>For example, here’s an inference rule that teaches make how to build an
object file from a C source file. This particular rule is one that
is pre-defined by make, so you’ll never need to write this one yourself.
I’ll include it for completeness.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.c.o</span><span class="o">:</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">-c</span> <span class="err">$&lt;</span>
</code></pre></div></div>

<p>These extensions must be added to <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> before they will work.
With that, the commands for the rules about object files can be omitted.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nl">.SUFFIXES</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>

<span class="nl">.SUFFIXES</span><span class="o">:</span> <span class="nf">.c .o</span>
<span class="nl">.c.o</span><span class="o">:</span>
    <span class="err">$(CC)</span> <span class="err">$(CFLAGS)</span> <span class="err">-c</span> <span class="err">$&lt;</span>
</code></pre></div></div>

<p>The first empty <code class="language-plaintext highlighter-rouge">.SUFFIXES</code> clears the suffix list. The second one adds
<code class="language-plaintext highlighter-rouge">.c</code> and <code class="language-plaintext highlighter-rouge">.o</code> to the now-empty suffix list.</p>

<h3 id="other-target-conventions">Other target conventions</h3>

<blockquote>
  <p>Conventions are, indeed, all that shield us from the shivering void,
though often they do so but poorly and desperately. ―Robert Aickman</p>
</blockquote>

<p>Users usually expect an “install” target that installs the built
program, libraries, man pages, etc. By convention this target should use
the <code class="language-plaintext highlighter-rouge">PREFIX</code> and <code class="language-plaintext highlighter-rouge">DESTDIR</code> macros.</p>

<p>The <code class="language-plaintext highlighter-rouge">PREFIX</code> macro should default to <code class="language-plaintext highlighter-rouge">/usr/local</code>, and since it’s a
macro the user can override it to install elsewhere, <a href="/blog/2017/06/19/">such as in their
home directory</a>. The user should override it for both building and
installing, since the prefix may need to be built into the binary (e.g.
<code class="language-plaintext highlighter-rouge">-DPREFIX=$(PREFIX)</code>).</p>

<p>The <code class="language-plaintext highlighter-rouge">DESTDIR</code> is macro is used for <em>staged builds</em>, so that it gets
installed under a fake root directory for the sake of packaging. Unlike
PREFIX, it will not actually be run from this directory.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.POSIX</span><span class="o">:</span>
<span class="nv">CC</span>     <span class="o">=</span> cc
<span class="nv">CFLAGS</span> <span class="o">=</span> <span class="nt">-W</span> <span class="nt">-O</span>
<span class="nv">LDLIBS</span> <span class="o">=</span> <span class="nt">-lm</span>
<span class="nv">PREFIX</span> <span class="o">=</span> /usr/local

<span class="nl">all</span><span class="o">:</span> <span class="nf">game</span>
<span class="nl">install</span><span class="o">:</span> <span class="nf">game</span>
    <span class="err">mkdir</span> <span class="err">-p</span> <span class="err">$(DESTDIR)$(PREFIX)/bin</span>
    <span class="err">mkdir</span> <span class="err">-p</span> <span class="err">$(DESTDIR)$(PREFIX)/share/man/man1</span>
    <span class="err">cp</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">$(DESTDIR)$(PREFIX)/bin</span>
    <span class="err">gzip</span> <span class="err">&lt;</span> <span class="err">game.1</span> <span class="err">&gt;</span> <span class="err">$(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz</span>
<span class="nl">game</span><span class="o">:</span> <span class="nf">graphics.o physics.o input.o</span>
    <span class="err">$(CC)</span> <span class="err">$(LDFLAGS)</span> <span class="err">-o</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span> <span class="err">$(LDLIBS)</span>
<span class="nl">graphics.o</span><span class="o">:</span> <span class="nf">graphics.c graphics.h</span>
<span class="nl">physics.o</span><span class="o">:</span> <span class="nf">physics.c physics.h</span>
<span class="nl">input.o</span><span class="o">:</span> <span class="nf">input.c input.h graphics.h physics.h</span>
<span class="nl">clean</span><span class="o">:</span>
    <span class="err">rm</span> <span class="err">-f</span> <span class="err">game</span> <span class="err">graphics.o</span> <span class="err">physics.o</span> <span class="err">input.o</span>
</code></pre></div></div>

<p>You may also want to provide an “uninstall” phony target that does the
opposite.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make PREFIX=$HOME/.local install
</code></pre></div></div>

<p>Other common targets are “mostlyclean” (like “clean” but don’t delete
some slow-to-build targets), “distclean” (delete even more than
“clean”), “test” or “check” (run the test suite), and “dist” (create a
package).</p>

<h3 id="complexity-and-growing-pains">Complexity and growing pains</h3>

<p>One of make’s big weak points is scaling up as a project grows in size.</p>

<h4 id="recursive-makefiles">Recursive Makefiles</h4>

<p>As your growing project is broken into subdirectories, you may be
tempted to put a Makefile in each subdirectory and invoke them
recursively.</p>

<p><a href="http://aegis.sourceforge.net/auug97.pdf"><strong>Don’t use recursive Makefiles</strong></a>. It breaks the dependency
tree across separate instances of make and typically results in a
fragile build. There’s nothing good about it. Have one Makefile at the
root of your project and invoke make there. You may have to teach your
text editor how to do this.</p>

<p>When talking about files in subdirectories, just include the
subdirectory in the name. Everything will work the same as far as make
is concerned, including inference rules.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">src/graphics.o</span><span class="o">:</span> <span class="nf">src/graphics.c</span>
<span class="nl">src/physics.o</span><span class="o">:</span> <span class="nf">src/physics.c</span>
<span class="nl">src/input.o</span><span class="o">:</span> <span class="nf">src/input.c</span>
</code></pre></div></div>

<h4 id="out-of-source-builds">Out-of-source builds</h4>

<p>Keeping your object files separate from your source files is a nice
idea. When it comes to make, there’s good news and bad news.</p>

<p>The good news is that make can do this. You can pick whatever file names
you like for targets and prerequisites.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">obj/input.o</span><span class="o">:</span> <span class="nf">src/input.c</span>
</code></pre></div></div>

<p>The bad news is that inference rules are not compatible with
out-of-source builds. You’ll need to repeat the same commands for each
rule as if inference rules didn’t exist. This is tedious for large
projects, so you may want to have some sort of “configure” script, even
if hand-written, to generate all this for you. This is essentially what
CMake is all about. That, plus dependency management.</p>

<h4 id="dependency-management">Dependency management</h4>

<p>Another problem with scaling up is tracking the project’s ever-changing
dependencies across all the source files. Missing a dependency means the
build may not be correct unless you <code class="language-plaintext highlighter-rouge">make clean</code> first.</p>

<p>If you go the route of using a script to generate the tedious parts of
the Makefile, both GCC and Clang have a nice feature for generating all
the Makefile dependencies for you (<code class="language-plaintext highlighter-rouge">-MM</code>, <code class="language-plaintext highlighter-rouge">-MT</code>), at least for C and
C++. There are lots of tutorials for doing this dependency generation on
the fly as part of the build, but it’s fragile and slow. Much better to
do it all up front and “bake” the dependencies into the Makefile so that
make can do its job properly. If the dependencies change, rebuild your
Makefile.</p>

<p>For example, here’s what it looks like invoking gcc’s dependency
generator against the imaginary <code class="language-plaintext highlighter-rouge">input.c</code> for an out-of-source build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h
</code></pre></div></div>

<p>Notice the output is in Makefile’s rule format.</p>

<p>Unfortunately this feature strips the leading paths from the target, so,
in practice, using it is always more complicated than it should be (e.g.
it requires the use of <code class="language-plaintext highlighter-rouge">-MT</code>).</p>

<h4 id="microsofts-nmake">Microsoft’s Nmake</h4>

<p>Microsoft has an implementation of make called Nmake, which <a href="/blog/2016/06/13/">comes with
Visual Studio</a>. It’s <em>nearly</em> a POSIX-compatible make, but
necessarily breaks from the standard in some places. Their cl.exe
compiler uses <code class="language-plaintext highlighter-rouge">.obj</code> as the object file extension and <code class="language-plaintext highlighter-rouge">.exe</code> for
binaries, both of which differ from the unix world, so it has different
built-in inference rules. Windows also lacks a Bourne shell and the
standard unix tools, so all of the commands will necessarily be
different.</p>

<p>There’s no equivalent of <code class="language-plaintext highlighter-rouge">rm -f</code> on Windows, so good luck writing a
proper “clean” target. No, <code class="language-plaintext highlighter-rouge">del /f</code> isn’t the same.</p>

<p>So while it’s close to POSIX make, it’s not practical to write a
Makefile that will simultaneously work properly with both POSIX make
and Nmake. These need to be separate Makefiles.</p>

<h3 id="may-your-makefiles-be-portable">May your Makefiles be portable</h3>

<p>It’s nice to have reliable, portable Makefiles that just work anywhere.
<a href="/blog/2017/03/30/">Code to the standards</a> and you don’t need feature tests or
other sorts of special treatment.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Integer Overflow into Information Disclosure</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/07/19/"/>
    <id>urn:uuid:c85545c5-23a4-3147-b654-6dc7a62ee426</id>
    <updated>2017-07-19T01:57:36Z</updated>
    <category term="netsec"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Last week I was discussing <a href="https://security-tracker.debian.org/tracker/CVE-2017-7529">CVE-2017-7529</a> with <a href="/blog/2016/09/02/">my intern</a>.
Specially crafted input to Nginx causes an integer overflow which has the
potential to leak sensitive information. But how could an integer overflow
be abused to trick a program into leaking information? To answer this
question, I put together the simplest practical example I could imagine.</p>

<ul>
  <li><a href="https://github.com/skeeto/integer-overflow-demo">https://github.com/skeeto/integer-overflow-demo</a></li>
</ul>

<p>This small C program converts a vector image from a custom format
(described below) into a <a href="https://en.wikipedia.org/wiki/Netpbm_format">Netpbm image</a>, a <a href="/blog/2017/07/02/">conveniently simple
format</a>. The program defensively and carefully parses its input, but
still makes a subtle, fatal mistake. This mistake not only leads to
sensitive information disclosure, but, with a more sophisticated attack,
could be used to execute arbitrary code.</p>

<p>After getting the hang of the interface for the program, I encourage you
to take some time to work out an exploit yourself. Regardless, I’ll reveal
a functioning exploit and explain how it works.</p>

<h3 id="a-new-vector-format">A new vector format</h3>

<p>The input format is line-oriented and very similar to Netpbm itself. The
first line is the header, starting with the magic number <code class="language-plaintext highlighter-rouge">V2</code> (ASCII)
followed by the image dimensions. The target output format is Netpbm’s
“P2” (text gray scale) format, so the “V2” parallels it. The file must end
with a newline.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>V2 &lt;width&gt; &lt;height&gt;
</code></pre></div></div>

<p>What follows is drawing commands, one per line. For example, the <code class="language-plaintext highlighter-rouge">s</code>
command sets the value of a particular pixel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s &lt;x&gt; &lt;y&gt; &lt;00–ff&gt;
</code></pre></div></div>

<p>Since it’s not important for the demonstration, this is the only command I
implemented. It’s easy to imagine additional commands to draw lines,
circles, Bezier curves, etc.</p>

<p>Here’s an example (<code class="language-plaintext highlighter-rouge">example.txt</code>) that draws a single white point in the
middle of the image:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>V2 256 256
s 127 127 ff
</code></pre></div></div>

<p>The rendering tool reads standard input to standard output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ render &lt; example.txt &gt; example.pgm
</code></pre></div></div>

<p>Here’s what it looks like rendered:</p>

<p><img src="/img/int-overflow/example.png" alt="" /></p>

<p>However, you will notice that when you run the rendering tool, it prompts
you for username and password. This is silly, of course, but it’s an
excuse to get “sensitive” information into memory. It will accept any
username/password combination where the username and password don’t match
each other. The key is this: <strong>It’s possible to craft a valid image that
leaks the the entered password.</strong></p>

<h3 id="tour-of-the-implementation">Tour of the implementation</h3>

<p>Without spoiling anything yet, let’s look at how this program works. The
first thing to notice is that I’m using a custom “<a href="http://www.gnu.org/software/libc/manual/html_node/Obstacks.html">obstack</a>”
allocator instead of <code class="language-plaintext highlighter-rouge">malloc()</code> and <code class="language-plaintext highlighter-rouge">free()</code>. Real-world allocators have
some defenses against this particular vulnerability. Plus a specific
exploit would have to target a specific libc. By using my own allocator,
the exploit will mostly be portable, making for a better and easier
demonstration.</p>

<p>The allocator interface should be pretty self-explanatory, except for two
details. This is an <em>obstack</em> allocator, so freeing an object also frees
every object allocated after it. Also, it doesn’t call <code class="language-plaintext highlighter-rouge">malloc()</code> in the
background. At initialization you give it a buffer from which to allocate
all memory.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">mstack</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">top</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">max</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[];</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">mstack</span> <span class="o">*</span><span class="nf">mstack_init</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span>          <span class="o">*</span><span class="nf">mstack_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">mstack</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span>           <span class="nf">mstack_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">mstack</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>There are no vulnerabilities in these functions (I hope!). It’s just
here for predictability.</p>

<p>Next here’s the “authentication” function. It reads a username and
password combination from <code class="language-plaintext highlighter-rouge">/dev/tty</code>. It’s only an excuse to get a flag in
memory for this capture-the-flag game. The username and password must be
less than 32 characters each.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">authenticate</span><span class="p">(</span><span class="k">struct</span> <span class="n">mstack</span> <span class="o">*</span><span class="n">m</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">FILE</span> <span class="o">*</span><span class="n">tty</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/tty"</span><span class="p">,</span> <span class="s">"r+"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tty</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">perror</span><span class="p">(</span><span class="s">"/dev/tty"</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">user</span> <span class="o">=</span> <span class="n">mstack_alloc</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="mi">32</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">user</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fclose</span><span class="p">(</span><span class="n">tty</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">fputs</span><span class="p">(</span><span class="s">"User: "</span><span class="p">,</span> <span class="n">tty</span><span class="p">);</span>
    <span class="n">fflush</span><span class="p">(</span><span class="n">tty</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">fgets</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="n">tty</span><span class="p">))</span>
        <span class="n">user</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="kt">char</span> <span class="o">*</span><span class="n">pass</span> <span class="o">=</span> <span class="n">mstack_alloc</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="mi">32</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">result</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">pass</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fputs</span><span class="p">(</span><span class="s">"Password: "</span><span class="p">,</span> <span class="n">tty</span><span class="p">);</span>
        <span class="n">fflush</span><span class="p">(</span><span class="n">tty</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">fgets</span><span class="p">(</span><span class="n">pass</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="n">tty</span><span class="p">))</span>
            <span class="n">result</span> <span class="o">=</span> <span class="n">strcmp</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">pass</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">fclose</span><span class="p">(</span><span class="n">tty</span><span class="p">);</span>
    <span class="n">mstack_free</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">user</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next here’s a little version of <code class="language-plaintext highlighter-rouge">calloc()</code> for the custom allocator. Hmm,
I wonder why this is called “naive”…</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">naive_calloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">mstack</span> <span class="o">*</span><span class="n">m</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mstack_alloc</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nmemb</span> <span class="o">*</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span>
        <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">nmemb</span> <span class="o">*</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next up is a paranoid wrapper for <code class="language-plaintext highlighter-rouge">strtoul()</code> that defensively checks its
inputs. If it’s out of range of an <code class="language-plaintext highlighter-rouge">unsigned long</code>, it bails out. If
there’s trailing garbage, it bails out. If there’s no number at all, it
bails out. If you make prolonged eye contact, it bails out.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">long</span>
<span class="nf">safe_strtoul</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">nptr</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">endptr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">base</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">n</span> <span class="o">=</span> <span class="n">strtoul</span><span class="p">(</span><span class="n">nptr</span><span class="p">,</span> <span class="n">endptr</span><span class="p">,</span> <span class="n">base</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">errno</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">perror</span><span class="p">(</span><span class="n">nptr</span><span class="p">);</span>
        <span class="n">exit</span><span class="p">(</span><span class="n">EXIT_FAILURE</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">nptr</span> <span class="o">==</span> <span class="o">*</span><span class="n">endptr</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Expected an integer</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
        <span class="n">exit</span><span class="p">(</span><span class="n">EXIT_FAILURE</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">isspace</span><span class="p">(</span><span class="o">**</span><span class="n">endptr</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Invalid character '%c'</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="o">**</span><span class="n">endptr</span><span class="p">);</span>
        <span class="n">exit</span><span class="p">(</span><span class="n">EXIT_FAILURE</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">main()</code> function parses the header using this wrapper and allocates
some zeroed memory:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">width</span> <span class="o">=</span> <span class="n">safe_strtoul</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">height</span> <span class="o">=</span> <span class="n">safe_strtoul</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">p</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pixels</span> <span class="o">=</span> <span class="n">naive_calloc</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">pixels</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">fputs</span><span class="p">(</span><span class="s">"Not enough memory</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">stderr</span><span class="p">);</span>
        <span class="n">exit</span><span class="p">(</span><span class="n">EXIT_FAILURE</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Then there’s a command processing loop, also using <code class="language-plaintext highlighter-rouge">safe_strtoul()</code>. It
carefully checks bounds against <code class="language-plaintext highlighter-rouge">width</code> and <code class="language-plaintext highlighter-rouge">height</code>. Finally it writes
out a Netpbm, P2 (.pgm) format.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">printf</span><span class="p">(</span><span class="s">"P2</span><span class="se">\n</span><span class="s">%ld %ld 255</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span>
            <span class="n">printf</span><span class="p">(</span><span class="s">"%d "</span><span class="p">,</span> <span class="n">pixels</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]);</span>
        <span class="n">putchar</span><span class="p">(</span><span class="sc">'\n'</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The vulnerability is in something I’ve shown above. Can you find it?</p>

<h3 id="exploiting-the-renderer">Exploiting the renderer</h3>

<p>Did you find it? If you’re on a platform with 64-bit <code class="language-plaintext highlighter-rouge">long</code>, here’s your
exploit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>V2 16 1152921504606846977
</code></pre></div></div>

<p>And here’s an exploit for 32-bit <code class="language-plaintext highlighter-rouge">long</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>V2 16 268435457
</code></pre></div></div>

<p>Here’s how it looks in action. The most obvious result is that the program
crashes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo V2 16 1152921504606846977 | ./mstack &gt; capture.txt
User: coolguy
Password: mysecret
Segmentation fault
</code></pre></div></div>

<p>Here are the initial contents of <code class="language-plaintext highlighter-rouge">capture.txt</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P2
16 1152921504606846977 255
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
109 121 115 101 99 114 101 116 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
</code></pre></div></div>

<p>Where did those junk numbers come from in the image data? Plug them into
an ASCII table and you’ll get “mysecret”. Despite allocating the image
with <code class="language-plaintext highlighter-rouge">naive_calloc()</code>, the password has found its way into the image! How
could this be?</p>

<p>What happened is that <code class="language-plaintext highlighter-rouge">width * height</code> overflows an <code class="language-plaintext highlighter-rouge">unsigned long</code>.
(Well, technically speaking, unsigned integers are defined <em>not</em> to
overflow in C, wrapping around instead, but it’s really the same thing.)
In <code class="language-plaintext highlighter-rouge">naive_calloc()</code>, the overflow results in a value of 16, so it only
allocates and clears 16 bytes. The requested allocation “succeeds” despite
<em>far</em> exceeding the available memory. The caller has been given a lot less
memory than expected, and the memory believed to have been allocated
contains a password.</p>

<p>The final part that writes the output doesn’t multiply the integers and
doesn’t need to test for overflow. It uses a nested loop instead,
continuing along with the original, impossible image size.</p>

<p>How do we fix this? Add an overflow check at the beginning of the
<code class="language-plaintext highlighter-rouge">naive_calloc()</code> function (making it no longer naive). This is what the
real <code class="language-plaintext highlighter-rouge">calloc()</code> does.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&amp;&amp;</span> <span class="n">size</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">1UL</span> <span class="o">/</span> <span class="n">nmemb</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>The frightening takeaway is that this check is <em>very</em> easy to forget. It’s
a subtle bug with potentially disastrous consequences.</p>

<p>In practice, this sort of program wouldn’t have sensitive data resident in
memory. Instead an attacker would target the program’s stack with those
<code class="language-plaintext highlighter-rouge">s</code> commands — specifically the <a href="/blog/2017/01/21/">return pointers</a> — and perform a ROP
attack against the application. With the exploit header above and a
platform where <code class="language-plaintext highlighter-rouge">long</code> the same size as a <code class="language-plaintext highlighter-rouge">size_t</code>, the program will behave
as if all available memory has been allocated to the image, so the <code class="language-plaintext highlighter-rouge">s</code>
command could be used to poke custom values <em>anywhere</em> in memory. This is
a much more complicated exploit, and it has to contend with ASLR and
random stack gap, but it’s feasible.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Rolling Shutter Simulation in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/07/02/"/>
    <id>urn:uuid:</id>
    <updated>2017-07-02T18:35:16Z</updated>
    <category term="c"/><category term="media"/><category term="tutorial"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>The most recent <a href="https://www.youtube.com/watch?v=dNVtMmLlnoE">Smarter Every Day (#172)</a> explains a phenomenon
that results from <em>rolling shutter</em>. You’ve likely seen this effect in
some of your own digital photographs. When a CMOS digital camera
captures a picture, it reads one row of the sensor at a time. If the
subject of the picture is a fast-moving object (relative to the
camera), then the subject will change significantly while the image is
being captured, giving strange, unreal results:</p>

<p><a href="/img/rolling-shutter/rolling-shutter.jpg"><img src="/img/rolling-shutter/rolling-shutter-thumb.jpg" alt="" /></a></p>

<p>In the <em>Smarter Every Day</em> video, Destin illustrates the effect by
simulating rolling shutter using a short video clip. In each frame of
the video, a few additional rows are locked in place, showing the
effect in slow motion, making it easier to understand.</p>

<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-5.mp4" width="500" height="500" loop="loop" controls="controls" autoplay="autoplay">
</video>

<p>At the end of the video he thanks a friend for figuring out how to get
After Effects to simulate rolling shutter. After thinking about this
for a moment, I figured I could easily accomplish this myself with
just a bit of C, without any libraries. The video above this paragraph
is the result.</p>

<p>I <a href="/blog/2011/11/28/">previously described a technique</a> to edit and manipulate
video without any formal video editing tools. A unix pipeline is
sufficient for doing minor video editing, especially without sound.
The program at the front of the pipe decodes the video into a raw,
uncompressed format, such as YUV4MPEG or <a href="https://en.wikipedia.org/wiki/Netpbm_format">PPM</a>. The tools in
the middle losslessly manipulate this data to achieve the desired
effect (watermark, scaling, etc.). Finally, the tool at the end
encodes the video into a standard format.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ decode video.mp4 | xform-a | xform-b | encode out.mp4
</code></pre></div></div>

<p>For the “decode” program I’ll be using ffmpeg now that it’s <a href="https://lwn.net/Articles/650816/">back in
the Debian repositories</a>. You can throw a video in virtually any
format at it and it will write PPM frames to standard output. For the
encoder I’ll be using the <code class="language-plaintext highlighter-rouge">x264</code> command line program, though ffmpeg
could handle this part as well. Without any filters in the middle,
this example will just re-encode a video:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    x264 -o output.mp4 /dev/stdin
</code></pre></div></div>

<p>The filter tools in the middle only need to read and write in the raw
image format. They’re a little bit like shaders, and they’re easy to
write. In this case, I’ll write C program that simulates rolling
shutter. The filter could be written in any language that can read and
write binary data from standard input to standard output.</p>

<p><em>Update</em>: It appears that input PPM streams are a rather recent
feature of libavformat (a.k.a lavf, used by <code class="language-plaintext highlighter-rouge">x264</code>). Support for PPM
input first appeared in libavformat 3.1 (released June 26th, 2016). If
you’re using an older version of libavformat, you’ll need to stick
<code class="language-plaintext highlighter-rouge">ppmtoy4m</code> in front of <code class="language-plaintext highlighter-rouge">x264</code> in the processing pipeline.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    ppmtoy4m | \
    x264 -o output.mp4 /dev/stdin
</code></pre></div></div>

<h3 id="video-filtering-in-c">Video filtering in C</h3>

<p>In the past, my go to for raw video data has been loose PPM frames and
YUV4MPEG streams (via <code class="language-plaintext highlighter-rouge">ppmtoy4m</code>). Fortunately, over the years a lot
of tools have gained the ability to manipulate streams of PPM images,
which is a much more convenient format. Despite being raw video data,
YUV4MPEG is still a fairly complex format with lots of options and
annoying colorspace concerns. <a href="http://netpbm.sourceforge.net/doc/ppm.html">PPM is simple RGB</a> without
complications. The header is just text:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P6
&lt;width&gt; &lt;height&gt;
&lt;maxdepth&gt;
&lt;width * height * 3 binary RGB data&gt;
</code></pre></div></div>

<p>The maximum depth is virtually always 255. A smaller value reduces the
image’s dynamic range without reducing the size. A larger value involves
byte-order issues (endian). For video frame data, the file will
typically look like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P6
1920 1080
255
&lt;frame RGB&gt;
</code></pre></div></div>

<p>Unfortunately the format is actually a little more flexible than this.
Except for the new line (LF, 0x0A) after the maximum depth, the
whitespace is arbitrary and comments starting with <code class="language-plaintext highlighter-rouge">#</code> are permitted.
Since the tools I’m using won’t produce comments, I’m going to ignore
that detail. I’ll also assume the maximum depth is always 255.</p>

<p>Here’s the structure I used to represent a PPM image, just one frame
of video. I’m using a <em>flexible array member</em> to pack the data at the
end of the structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">frame</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">width</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">height</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">data</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Next a function to allocate a frame:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span>
<span class="nf">frame_create</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">width</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">height</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">+</span> <span class="n">width</span> <span class="o">*</span> <span class="n">height</span> <span class="o">*</span> <span class="mi">3</span><span class="p">);</span>
    <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">=</span> <span class="n">width</span><span class="p">;</span>
    <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">=</span> <span class="n">height</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We’ll need a way to write the frames we’ve created.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">frame_write</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"P6</span><span class="se">\n</span><span class="s">%zu %zu</span><span class="se">\n</span><span class="s">255</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span><span class="p">,</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">);</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">f</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">,</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, a function to read a frame, reusing an existing buffer if
possible. The most complex part of the whole program is just parsing
the PPM header. The <code class="language-plaintext highlighter-rouge">%*c</code> in the <code class="language-plaintext highlighter-rouge">scanf()</code> specifically consumes the
line feed immediately following the maximum depth.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span>
<span class="nf">frame_read</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">scanf</span><span class="p">(</span><span class="s">"P6 %zu%zu%*d%*c"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">width</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">height</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">f</span> <span class="o">||</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">!=</span> <span class="n">width</span> <span class="o">||</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">!=</span> <span class="n">height</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
        <span class="n">f</span> <span class="o">=</span> <span class="n">frame_create</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">fread</span><span class="p">(</span><span class="n">f</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">,</span> <span class="n">width</span> <span class="o">*</span> <span class="n">height</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">stdin</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since this program will only be part of a pipeline, I’m not worried
about checking the results of <code class="language-plaintext highlighter-rouge">fwrite()</code> and <code class="language-plaintext highlighter-rouge">fread()</code>. The process
will be killed by the shell if something goes wrong with the pipes.
However, if we’re out of video data and get an EOF, <code class="language-plaintext highlighter-rouge">scanf()</code> will
fail, indicating the EOF, which is normal and can be handled cleanly.</p>

<h4 id="an-identity-filter">An identity filter</h4>

<p>That’s all the infrastructure we need to built an identity filter that
passes frames through unchanged:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">frame</span> <span class="o">=</span> <span class="n">frame_read</span><span class="p">(</span><span class="n">frame</span><span class="p">)))</span>
        <span class="n">frame_write</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Processing a frame is just matter of adding some stuff to the body of
the <code class="language-plaintext highlighter-rouge">while</code> loop.</p>

<h4 id="a-rolling-shutter-filter">A rolling shutter filter</h4>

<p>For the rolling shutter filter, in addition to the input frame we need
an image to hold the result of the rolling shutter. Each input frame
will be copied into the rolling shutter frame, but a little less will be
copied from each frame, locking a little bit more of the image in place.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">shutter_step</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">shutter</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span> <span class="n">frame_read</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="n">frame_create</span><span class="p">(</span><span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span><span class="p">,</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">);</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">shutter</span> <span class="o">&lt;</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">f</span> <span class="o">=</span> <span class="n">frame_read</span><span class="p">(</span><span class="n">f</span><span class="p">)))</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="n">shutter</span> <span class="o">*</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="mi">3</span><span class="p">;</span>
        <span class="kt">size_t</span> <span class="n">length</span> <span class="o">=</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">height</span> <span class="o">*</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">width</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">-</span> <span class="n">offset</span><span class="p">;</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">out</span><span class="o">-&gt;</span><span class="n">data</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">f</span><span class="o">-&gt;</span><span class="n">data</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
        <span class="n">frame_write</span><span class="p">(</span><span class="n">out</span><span class="p">);</span>
        <span class="n">shutter</span> <span class="o">+=</span> <span class="n">shutter_step</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">out</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">shutter_step</code> controls how many rows are capture per frame of
video. Generally capturing one row per frame is too slow for the
simulation. For a 1080p video, that’s 1,080 frames for the entire
simulation: 18 seconds at 60 FPS or 36 seconds at 30 FPS. If this
program were to accept command line arguments, controlling the shutter
rate would be one of the options.</p>

<p>Putting it all together:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    ./rolling-shutter | \
    x264 -o output.mp4 /dev/stdin
</code></pre></div></div>

<p>Here are some of the results for different shutter rates: 1, 3, 5, 8,
10, and 15 rows per frame. Feel free to right-click and “View Video”
to see the full resolution video.</p>

<div class="grid">
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-1.mp4" width="300" height="300" controls="controls">
</video>
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-3.mp4" width="300" height="300" controls="controls">
</video>
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-5.mp4" width="300" height="300" controls="controls">
</video>
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-8.mp4" width="300" height="300" controls="controls">
</video>
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-10.mp4" width="300" height="300" controls="controls">
</video>
<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/rolling-shutter-15.mp4" width="300" height="300" controls="controls">
</video>
</div>

<h3 id="source-and-original-input">Source and original input</h3>

<p>This post contains the full source in parts, but here it is all together:</p>

<ul>
  <li><a href="/download/rshutter.c" class="download">rshutter.c</a></li>
</ul>

<p>Here’s the original video, filmed by my wife using her Nikon D5500, in
case you want to try it for yourself:</p>

<video src="https://nullprogram.s3.amazonaws.com/rolling-shutter/original.mp4" width="300" height="300" controls="controls">
</video>

<p>It took much longer to figure out the string-pulling contraption to
slowly spin the fan at a constant rate than it took to write the C
filter program.</p>

<h3 id="followup-links">Followup Links</h3>

<p>On Hacker News, <a href="https://news.ycombinator.com/item?id=14684793">morecoffee shared a video of the second order
effect</a> (<a href="http://antidom.com/fan.webm">direct link</a>), where the rolling shutter
speed changes over time.</p>

<p>A deeper analysis of rolling shutter: <a href="http://danielwalsh.tumblr.com/post/54400376441/playing-detective-with-rolling-shutter-photos"><em>Playing detective with rolling
shutter photos</em></a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Stack Clashing for Fun and Profit</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/06/21/"/>
    <id>urn:uuid:43402771-3340-3dff-c18f-7110caeedb7d</id>
    <updated>2017-06-21T05:28:56Z</updated>
    <category term="c"/><category term="posix"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Stack clashing</em> has been in the news lately due to <a href="https://blog.qualys.com/securitylabs/2017/06/19/the-stack-clash">some recently
discovered vulnerablities</a> along with proof-of-concept
exploits. As the announcement itself notes, this is not a new issue,
though this appears to be the first time it’s been given this
particular name. I do know of one “good” use of stack clashing, where
it’s used for something productive than as part of an attack. In this
article I’ll explain how it works.</p>

<p>You can find the complete code for this article here, ready to run:</p>

<ul>
  <li><a href="https://github.com/skeeto/stack-clash-coroutine">https://github.com/skeeto/stack-clash-coroutine</a></li>
</ul>

<p>But first, what is a stack clash? Here’s a rough picture of the
typical way process memory is laid out. The stack starts at a high
memory address and grows downwards. Code and static data sit at low
memory, with a <code class="language-plaintext highlighter-rouge">brk</code> pointer growing upward to make small allocations.
In the middle is the heap, where large allocations and memory mappings
take place.</p>

<p><img src="/img/diagram/process-memory.svg" alt="" /></p>

<p>Below the stack is a slim <em>guard page</em> that divides the stack and the
region of memory reserved for the heap. Reading or writing to that
memory will trap, causing the program to crash or some special action
to be taken. The goal is to prevent the stack from growing into the
heap, which could cause all sorts of trouble, like security issues.</p>

<p>The problem is that this thin guard page isn’t enough. It’s possible to
put a large allocation on the stack, never read or write to it, and
completely skip over the guard page, such that the heap and stack
overlap without detection.</p>

<p>Once this happens, writes into the heap will change memory on the
stack and vice versa. If an attacker can cause the program to make
such a large allocation on the stack, then legitimate writes into
memory on the heap can manipulate local variables or <a href="/blog/2017/01/21/">return pointers,
changing the program’s control flow</a>. This can bypass buffer
overflow protections, such as stack canaries.</p>

<h3 id="binary-trees-and-coroutines">Binary trees and coroutines</h3>

<p><img src="/img/diagram/binary-search-tree.svg" alt="" /></p>

<p>Now, I’m going to abruptly change topics to discuss binary search
trees. We’ll get back to stack clash in a bit. Suppose we have a
binary tree which we would like to iterate depth-first. For this
demonstration, here’s the C interface to the binary tree.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">left</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">right</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">value</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">void</span>  <span class="nf">tree_insert</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">**</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">);</span>
<span class="kt">char</span> <span class="o">*</span><span class="nf">tree_find</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">tree_visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">));</span>
<span class="kt">void</span>  <span class="nf">tree_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>An empty tree is the NULL pointer, hence the double-pointer for
insert. In the demonstration it’s an unbalanced search tree, but this
could very well be a balanced search tree with the addition of another
field on the structure.</p>

<p>For the traversal, first visit the root node, then traverse its left
tree, and finally traverse its right tree. It makes for a simple,
recursive definition — the sort of thing you’d teach a beginner.
Here’s a definition that accepts a callback, which the caller will use
to <em>visit</em> each key/value in the tree. This really is as simple as it
gets.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">tree_visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">f</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">,</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">);</span>
        <span class="n">tree_visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
        <span class="n">tree_visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this isn’t so convenient for the caller, who has to
split off a callback function that <a href="/blog/2017/01/08/">lacks context</a>, then hand
over control to the traversal function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">printer</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%s = %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">print_tree</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">tree</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">tree_visit</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">printer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Usually it’s much nicer for the caller if instead it’s provided an
<em>iterator</em>, which the caller can invoke at will. Here’s an interface
for it, just two functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="p">);</span>
<span class="kt">int</span>             <span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">);</span>
</code></pre></div></div>

<p>The first constructs an iterator object, and the second one visits a
key/value pair each time it’s called. It returns 0 when traversal is
complete, automatically freeing any resources associated with the
iterator.</p>

<p>The caller now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">,</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">tree_iterator</span><span class="p">(</span><span class="n">tree</span><span class="p">);</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">tree_next</span><span class="p">(</span><span class="n">it</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">k</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">v</span><span class="p">))</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%s = %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">);</span>
</code></pre></div></div>

<p>Notice I haven’t defined <code class="language-plaintext highlighter-rouge">struct tree_it</code>. That’s because I’ve got
four different implementations, each taking a different approach. The
last one will use stack clashing.</p>

<h4 id="manual-state-tracking">Manual State Tracking</h4>

<p>With just the standard facilities provided by C, there’s a some manual
bookkeeping that has to take place in order to convert the recursive
definition into an iterator. Depth-first traversal is a stack-oriented
process, and with recursion the stack is implicit in the call stack.
As an iterator, the traversal stack needs to be <a href="/blog/2016/11/13/">managed
explicitly</a>. The iterator needs to keep track of the path it
took so that it can backtrack, which means keeping track of parent
nodes as well as which branch was taken.</p>

<p>Here’s my little implementation, which, to keep things simple, has a
hard depth limit of 32. It’s structure definition includes a stack of
node pointers, and 2 bits of information per visited node, stored
across a 64-bit integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">stack</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">state</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">nstack</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The 2 bits track three different states for each visited node:</p>

<ol>
  <li>Visit the current node</li>
  <li>Traverse the left tree</li>
  <li>Traverse the right tree</li>
</ol>

<p>It works out to the following. Don’t worry too much about trying to
understand how this works. My point is to demonstrate that converting
the recursive definition into an iterator complicates the
implementation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">shift</span> <span class="o">=</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">state</span> <span class="o">=</span> <span class="mi">3u</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&gt;&gt;</span> <span class="n">shift</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">+=</span> <span class="mi">1ull</span> <span class="o">&lt;&lt;</span> <span class="n">shift</span><span class="p">;</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">case</span> <span class="mi">0</span><span class="p">:</span>
                <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
                <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">;</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="p">(</span><span class="mi">3ull</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">shift</span> <span class="o">+</span> <span class="mi">2</span><span class="p">));</span>
                <span class="p">}</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
            <span class="k">case</span> <span class="mi">1</span><span class="p">:</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">[</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">;</span>
                    <span class="n">it</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="p">(</span><span class="mi">3ull</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">shift</span> <span class="o">+</span> <span class="mi">2</span><span class="p">));</span>
                <span class="p">}</span>
                <span class="k">break</span><span class="p">;</span>
            <span class="k">case</span> <span class="mi">2</span><span class="p">:</span>
                <span class="n">it</span><span class="o">-&gt;</span><span class="n">nstack</span><span class="o">--</span><span class="p">;</span>
                <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Wouldn’t it be nice to keep both the recursive definition while also
getting an iterator? There’s an exact solution to that: coroutines.</p>

<h4 id="coroutines">Coroutines</h4>

<p>C doesn’t come with coroutines, but there are a number of libraries
available. We can also build our own coroutines. One way to do that is
with <em>user contexts</em> (<code class="language-plaintext highlighter-rouge">&lt;ucontext.h&gt;</code>) provided by the X/Open System
Interfaces Extension (XSI), an extension to POSIX. This set of
functions allow programs to create their own call stacks and switch
between them. That’s the key ingredient for coroutines. Caveat: These
functions aren’t widely available, and probably shouldn’t be used in
new code.</p>

<p>Here’s my iterator structure definition.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _XOPEN_SOURCE 600
#include</span> <span class="cpf">&lt;ucontext.h&gt;</span><span class="cp">
</span>
<span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="n">ucontext_t</span> <span class="n">coroutine</span><span class="p">;</span>
    <span class="n">ucontext_t</span> <span class="n">yield</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>It needs one context for the original stack and one context for the
iterator’s stack. Each time the iterator is invoked, it the program
will switch to the other stack, find the next value, then switch back.
This process is called <em>yielding</em>. Values are passed between context
using the <code class="language-plaintext highlighter-rouge">k</code> (key) and <code class="language-plaintext highlighter-rouge">v</code> (value) fields on the iterator.</p>

<p>Before I get into initialization, here’s the actual traversal
coroutine. It’s nearly the same as the original recursive definition
except for the <code class="language-plaintext highlighter-rouge">swapcontext()</code>. This is the <em>yield</em>, pausing execution
and sending control back to the caller. The current context is saved
in the first argument, and the second argument becomes the current
context.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="n">swapcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>While the actual traversal is simple again, initialization is more
complicated. The first problem is that there’s no way to pass pointer
arguments to the coroutine. Technically only <code class="language-plaintext highlighter-rouge">int</code> arguments are
permitted. (All the online tutorials get this wrong.) To work around
this problem, I smuggle the arguments in as global variables. This
would cause problems should two different threads try to create
iterators at the same time, even on different trees.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">tree_arg</span><span class="p">;</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">tree_it_arg</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">coroutine</span><span class="p">(</span><span class="n">tree_arg</span><span class="p">,</span> <span class="n">tree_it_arg</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The stack has to be allocated manually, which I do with a call to
<code class="language-plaintext highlighter-rouge">malloc()</code>. Nothing <a href="/blog/2015/05/15/">fancy is needed</a>, though this means the new
stack won’t have a guard page. For the stack size, I use the suggested
value of <code class="language-plaintext highlighter-rouge">SIGSTKSZ</code>. The <code class="language-plaintext highlighter-rouge">makecontext()</code> function is what creates the
new context from scratch, but the new context must first be
initialized with <code class="language-plaintext highlighter-rouge">getcontext()</code>, even though that particular snapshot
won’t actually be used.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">SIGSTKSZ</span><span class="p">);</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_size</span> <span class="o">=</span> <span class="n">SIGSTKSZ</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_link</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">;</span>
    <span class="n">getcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">);</span>
    <span class="n">makecontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="n">coroutine_init</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">tree_arg</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">tree_it_arg</span> <span class="o">=</span> <span class="n">it</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice I gave it a function pointer, a lot like I’m starting a new
thread. This is no coincidence. There’s a lot of similarity between
coroutines and multiple threads, as you’ll soon see.</p>

<p>Finally the iterator function itself. Since NULL isn’t a valid key, it
initializes the key to NULL before yielding to the iterator context.
If the iterator has no more nodes to visit, it doesn’t set the key,
which can be detected when control returns.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">swapcontext</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">.</span><span class="n">uc_stack</span><span class="p">.</span><span class="n">ss_sp</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>That’s all it takes to create and operate a coroutine in C, provided
you’re on a system with these XSI extensions.</p>

<h4 id="semaphores">Semaphores</h4>

<p>Instead of a coroutine, we could just use actual threads and a couple
of semaphores to synchronize them. This is a heavy implementation and
also probably shouldn’t be used in practice, but at least it’s fully
portable.</p>

<p>Here’s the structure definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="n">sem_t</span> <span class="n">visitor</span><span class="p">;</span>
    <span class="n">sem_t</span> <span class="n">main</span><span class="p">;</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The main thread will wait on one semaphore and the iterator thread
will wait on the other. This <a href="/blog/2017/02/14/">should sound very familiar</a>.</p>

<p>The actual traversal function looks the same, but with <code class="language-plaintext highlighter-rouge">sem_post()</code>
and <code class="language-plaintext highlighter-rouge">sem_wait()</code> as the yield.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">visit</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
        <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
        <span class="n">visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">visit</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s a separate function to initialize the iterator context again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">thread_entrance</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">arg</span><span class="p">;</span>
    <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
    <span class="n">visit</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">t</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Creating the iterator only requires initializing the semaphores and
creating the thread:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="kr">thread</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">thread_entrance</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The iterator function looks just like the coroutine version.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
    <span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">pthread_join</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="kr">thread</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="n">sem_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">main</span><span class="p">);</span>
        <span class="n">sem_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">visitor</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Overall, this is almost identical to the coroutine version.</p>

<h4 id="coroutines-using-stack-clashing">Coroutines using stack clashing</h4>

<p>Finally I can tie this back into the topic at hand. Without either XSI
extensions or Pthreads, we can (usually) create coroutines by abusing
<code class="language-plaintext highlighter-rouge">setjmp()</code> and <code class="language-plaintext highlighter-rouge">longjmp()</code>. Technically this violates two of the C’s
rules and relies on undefined behavior, but it generally works. This
<a href="http://fanf.livejournal.com/105413.html">is not my own invention</a>, and it dates back to at least 2010.</p>

<p>From the very beginning, C has provided a crude “exception” mechanism
that allows the stack to be abruptly unwound back to a previous state.
It’s a sort of non-local goto. Call <code class="language-plaintext highlighter-rouge">setjmp()</code> to capture an opaque
<code class="language-plaintext highlighter-rouge">jmp_buf</code> object to be used in the future. This function returns 0
this first time. Hand that value to <code class="language-plaintext highlighter-rouge">longjmp()</code> later, even in a
different function, and <code class="language-plaintext highlighter-rouge">setjmp()</code> will return again, this time with a
non-zero value.</p>

<p>It’s technically unsuitable for coroutines because the jump is a
one-way trip. The unwound stack invalidates any <code class="language-plaintext highlighter-rouge">jmp_buf</code> that was
created after the target of the jump. In practice, though, you can
still use these jumps, which is one rule being broken.</p>

<p>That’s where stack clashing comes into play. In order for it to be a
proper coroutine, it needs to have its own stack. But how can we do
that with these primitive C utilities? <strong>Extend the stack to overlap
the heap, call <code class="language-plaintext highlighter-rouge">setjmp()</code> to capture a coroutine on it, then return.</strong>
Generally we can get away with using <code class="language-plaintext highlighter-rouge">longjmp()</code> to return to this
heap-allocated stack.</p>

<p>Here’s my iterator definition for this one. Like the XSI context
struct, this has two <code class="language-plaintext highlighter-rouge">jmp_buf</code> “contexts.” The <code class="language-plaintext highlighter-rouge">stack</code> holds the
iterator’s stack buffer so that it can be freed, and the <code class="language-plaintext highlighter-rouge">gap</code> field
will be used to prevent the optimizer from spoiling our plans.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">k</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">stack</span><span class="p">;</span>
    <span class="k">volatile</span> <span class="kt">char</span> <span class="o">*</span><span class="n">gap</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">coroutine</span><span class="p">;</span>
    <span class="kt">jmp_buf</span> <span class="n">yield</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The coroutine looks familiar again. This time the yield is performed
with <code class="language-plaintext highlighter-rouge">setjmmp()</code> and <code class="language-plaintext highlighter-rouge">longjmp()</code>, just like <code class="language-plaintext highlighter-rouge">swapcontext()</code>. Remember
that <code class="language-plaintext highlighter-rouge">setjmp()</code> returns twice, hence the branch. The <code class="language-plaintext highlighter-rouge">longjmp()</code> never
returns.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">coroutine</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">;</span>
        <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">))</span>
            <span class="n">longjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">left</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
        <span class="n">coroutine</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">right</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next is the tricky part to cause the stack clash. First, allocate the
new stack with <code class="language-plaintext highlighter-rouge">malloc()</code> so that we can get its address. Then use a
local variable on the stack to determine how much the stack needs to
grow in order to overlap with the allocation. Taking the difference
between these pointers is illegal as far as the language is concerned,
making this the second rule I’m breaking. I can <a href="/blog/2017/05/03/">imagine an
implementation</a> where the stack and heap are in two separate
kinds of memory, and it would be meaningless to take the difference. I
don’t actually have to imagine very hard, because this is actually how
it used to work on the 8086 with its <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">segmented memory
architecture</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span>
<span class="nf">tree_iterator</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="p">));</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">STACK_SIZE</span><span class="p">);</span>
    <span class="kt">char</span> <span class="n">marker</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">gap</span><span class="p">[</span><span class="o">&amp;</span><span class="n">marker</span> <span class="o">-</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">-</span> <span class="n">STACK_SIZE</span><span class="p">];</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">gap</span> <span class="o">=</span> <span class="n">gap</span><span class="p">;</span> <span class="c1">// prevent optimization</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">))</span>
        <span class="n">coroutine_init</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">it</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">it</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’m using a variable-length array (VLA) named <code class="language-plaintext highlighter-rouge">gap</code> to indirectly
control the stack pointer, moving it over the heap. I’m assuming the
stack grows downward, since otherwise the sign would be wrong.</p>

<p>The compiler is smart and will notice I’m not actually using <code class="language-plaintext highlighter-rouge">gap</code>,
and it’s happy to throw it away. In fact, it’s vitally important that
I <em>don’t</em> touch it since the guard page, along with a bunch of
unmapped memory, is actually somewhere in the middle of that array. I
only want the array for its side effect, but that side effect isn’t
officially supported, which means the optimizer doesn’t need to
consider it in its decisions. To inhibit the optimizer, I store the
array’s address where someone might potentially look at it, meaning
the array has to exist.</p>

<p>Finally, the iterator function looks just like the others, again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">tree_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">tree_it</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">k</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">setjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">yield</span><span class="p">))</span>
        <span class="n">longjmp</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">coroutine</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">k</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">k</span><span class="p">;</span>
        <span class="o">*</span><span class="n">v</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">v</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">);</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it: a nasty hack using a stack clash to create a context
for a <code class="language-plaintext highlighter-rouge">setjmp()</code>+<code class="language-plaintext highlighter-rouge">longjmp()</code> coroutine.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Building and Installing Software in $HOME</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/06/19/"/>
    <id>urn:uuid:ae490550-a3b8-3b8f-4338-c2aba7306c8f</id>
    <updated>2017-06-19T02:34:39Z</updated>
    <category term="linux"/><category term="tutorial"/><category term="debian"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>For more than 5 years now I’ve kept a private “root” filesystem within
my home directory under <code class="language-plaintext highlighter-rouge">$HOME/.local/</code>. Within are the standard
<code class="language-plaintext highlighter-rouge">/usr</code> directories, such as <code class="language-plaintext highlighter-rouge">bin/</code>, <code class="language-plaintext highlighter-rouge">include/</code>, <code class="language-plaintext highlighter-rouge">lib/</code>, etc.,
containing my own software, libraries, and man pages. These are
first-class citizens, indistinguishable from the system-installed
programs and libraries. With one exception (setuid programs), none of
this requires root privileges.</p>

<p>Installing software in $HOME serves two important purposes, both of
which are indispensable to me on a regular basis.</p>

<ul>
  <li><strong>No root access</strong>: Sometimes I’m using a system administered by
someone else, and I don’t have root access.</li>
</ul>

<p>This prevents me from installing packaged software myself through the
system’s package manager. Building and installing the software myself in
my home directory, without involvement from the system administrator,
neatly works around this issue. As a software developer, it’s already
perfectly normal for me to build and run custom software, and this is
just an extension of that behavior.</p>

<p>In the most desperate situation, all I need from the sysadmin is a
decent C compiler and at least a minimal POSIX environment. I can
<a href="/blog/2016/11/17/">bootstrap anything I might need</a>, both libraries and
programs, including a better C compiler along the way. This is one
major strength of open source software.</p>

<p>I have noticed one alarming trend: Both GCC (since 4.8) and Clang are
written in C++, so it’s becoming less and less reasonable to bootstrap
a C++ compiler from a C compiler, or even from a C++ compiler that’s
more than a few years old. So you may also need your sysadmin to
supply a fairly recent C++ compiler if you want to bootstrap an
environment that includes C++. I’ve had to avoid some C++ software
(such as CMake) for this reason.</p>

<ul>
  <li><strong>Custom software builds</strong>: Even if I <em>am</em> root, I may still want to
install software not available through the package manager, a version
not available in the package manager, or a version with custom
patches.</li>
</ul>

<p>In theory this is what <code class="language-plaintext highlighter-rouge">/usr/local</code> is all about. It’s typically the
location for software not managed by the system’s package manager.
However, I think it’s cleaner to put this in <code class="language-plaintext highlighter-rouge">$HOME/.local</code>, so long
as other system users don’t need it.</p>

<p>For example, I have an installation of each version of Emacs between
24.3 (the oldest version worth supporting) through the latest stable
release, each suffixed with its version number, under <code class="language-plaintext highlighter-rouge">$HOME/.local</code>.
This is useful for quickly running a test suite under different
releases.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...
</code></pre></div></div>

<p>Another example is NetHack, which I prefer to play with a couple of
custom patches (<a href="https://bilious.alt.org/?11">Menucolors</a>, <a href="https://gist.github.com/skeeto/11fed852dbfe9889a5fce80e9f6576ac">wchar</a>). The install to
<code class="language-plaintext highlighter-rouge">$HOME/.local</code> <a href="https://gist.github.com/skeeto/5cb9d5e774ce62655aff3507cb806981">is also captured as a patch</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 &lt; ~/nh343-menucolor.diff
$ patch -p1 &lt; ~/nh343-wchar.diff
$ patch -p1 &lt; ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install
</code></pre></div></div>

<p>Normally NetHack wants to be setuid (e.g. run as the “games” user) in
order to restrict access to high scores, saves, and bones — saved levels
where a player died, to be inserted randomly into other players’ games.
This prevents cheating, but requires root to set up. Fortunately, when I
install NetHack in my home directory, this isn’t a feature I actually
care about, so I can ignore it.</p>

<p><a href="/blog/2017/06/15/">Mutt</a> is in a similar situation, since it wants to install a
special setgid program (<code class="language-plaintext highlighter-rouge">mutt_dotlock</code>) that synchronizes mailbox
access. All MUAs need something like this.</p>

<p>Everything described below is relevant to basically any modern
unix-like system: Linux, BSD, etc. I personally install software in
$HOME across a variety of systems and, fortunately, it mostly works
the same way everywhere. This is probably in large part due to
everyone standardizing around the GCC and GNU binutils interfaces,
even if the system compiler is actually LLVM/Clang.</p>

<h3 id="configuring-for-home-installs">Configuring for $HOME installs</h3>

<p>Out of the box, installing things in <code class="language-plaintext highlighter-rouge">$HOME/.local</code> won’t do anything
useful. You need to set up some environment variables in your shell
configuration (i.e. <code class="language-plaintext highlighter-rouge">.profile</code>, <code class="language-plaintext highlighter-rouge">.bashrc</code>, etc.) to tell various
programs, such as your shell, about it. The most obvious variable is
$PATH:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<p>Notice I put it in the front of the list. This is because I want my
home directory programs to override system programs with the same
name. For what other reason would I install a program with the same
name if not to override the system program?</p>

<p>In the simplest situation this is good enough, but in practice you’ll
probably need to set a few more things. If you install libraries in
your home directory and expect to use them just as if they were
installed on the system, you’ll need to tell the compiler where else
to look for those headers and libraries, both for C and C++.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">C_INCLUDE_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/include
<span class="nb">export </span><span class="nv">CPLUS_INCLUDE_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/include
<span class="nb">export </span><span class="nv">LIBRARY_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib
</code></pre></div></div>

<p>The first two are like the <code class="language-plaintext highlighter-rouge">-I</code> compiler option and the third is like
<code class="language-plaintext highlighter-rouge">-L</code> linker option, except you <em>usually</em> won’t need to use them
explicitly. Unfortunately <code class="language-plaintext highlighter-rouge">LIBRARY_PATH</code> doesn’t override the system
library paths, so in some cases, you will need to explicitly set
<code class="language-plaintext highlighter-rouge">-L</code>. Otherwise you will still end up linking against the system library
rather than the custom packaged version. I really wish GCC and Clang
didn’t behave this way.</p>

<p>Some software uses <code class="language-plaintext highlighter-rouge">pkg-config</code> to determine its compiler and linker
flags, and your home directory will contain some of the needed
information. So set that up too:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PKG_CONFIG_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib/pkgconfig
</code></pre></div></div>

<h4 id="run-time-linker">Run-time linker</h4>

<p>Finally, when you install libraries in your home directory, the run-time
dynamic linker will need to know where to find them. There are three
ways to deal with this:</p>

<ol>
  <li>The <a href="https://web.archive.org/web/20090312014334/http://blogs.sun.com/rie/entry/tt_ld_library_path_tt">crude, easy way</a>: <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>.</li>
  <li>The elegant, difficult way: ELF runpath.</li>
  <li>Screw it, just statically link the bugger. (Not always possible.)</li>
</ol>

<p>For the crude way, point the run-time linker at your <code class="language-plaintext highlighter-rouge">lib/</code> and you’re
done:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/lib
</code></pre></div></div>

<p>However, this is like using a shotgun to kill a fly. If you install a
library in your home directory that is also installed on the system,
and then run a system program, it may be linked against <em>your</em> library
rather than the library installed on the system as was originally
intended. This could have detrimental effects.</p>

<p>The precision method is to set the ELF “runpath” value. It’s like a
per-binary <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>. The run-time linker uses this path first
in its search for libraries, and it will only have an effect on that
particular program/library. This also applies to <code class="language-plaintext highlighter-rouge">dlopen()</code>.</p>

<p>Some software will configure the runpath by default in their build
system, but often you need to configure this yourself. The simplest way
is to set the <code class="language-plaintext highlighter-rouge">LD_RUN_PATH</code> environment variable when building software.
Another option is to manually pass <code class="language-plaintext highlighter-rouge">-rpath</code> options to the linker via
<code class="language-plaintext highlighter-rouge">LDFLAGS</code>. It’s used directly like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux
</code></pre></div></div>

<p>Verify with <code class="language-plaintext highlighter-rouge">readelf</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]
</code></pre></div></div>

<p>ELF supports a special <code class="language-plaintext highlighter-rouge">$ORIGIN</code> “variable” set to the binary’s
location. This allows the program and associated libraries to be
installed anywhere without changes, so long as they have the same
relative position to each other . (Note the quotes to prevent shell
interpolation.)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux
</code></pre></div></div>

<p>There is one situation where <code class="language-plaintext highlighter-rouge">runpath</code> won’t work: when you want a
system-installed program to find a home directory library with
<code class="language-plaintext highlighter-rouge">dlopen()</code> — e.g. as an extension to that program. You either need to
ensure it uses a relative or absolute path (i.e. the argument to
<code class="language-plaintext highlighter-rouge">dlopen()</code> contains a slash) or you must use <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>.</p>

<p>Personally, I always use the <a href="https://www.jwz.org/doc/worse-is-better.html">Worse is Better</a> <code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>
shotgun. Occasionally it’s caused some annoying issues, but the vast
majority of the time it gets the job done with little fuss. This is
just my personal development environment, after all, not a production
server.</p>

<h4 id="manual-pages">Manual pages</h4>

<p>Another potentially tricky issue is man pages. When a program or
library installs a man page in your home directory, it would certainly
be nice to access it with <code class="language-plaintext highlighter-rouge">man &lt;topic&gt;</code> just like it was installed on
the system. Fortunately, Debian and Debian-derived systems, using a
mechanism I haven’t yet figured out, discover home directory man pages
automatically without any assistance. No configuration needed.</p>

<p>It’s more complicated on other systems, such as the BSDs. You’ll need to
set the <code class="language-plaintext highlighter-rouge">MANPATH</code> variable to include <code class="language-plaintext highlighter-rouge">$HOME/.local/share/man</code>. It’s
unset by default and it overrides the system settings, which means you
need to manually include the system paths. The <code class="language-plaintext highlighter-rouge">manpath</code> program can
help with this … if it’s available.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">MANPATH</span><span class="o">=</span><span class="nv">$HOME</span>/.local/share/man:<span class="si">$(</span>manpath<span class="si">)</span>
</code></pre></div></div>

<p>I haven’t figured out a portable way to deal with this issue, so I
mostly ignore it.</p>

<h3 id="how-to-install-software-in-home">How to install software in $HOME</h3>

<p>While I’ve <a href="/blog/2017/03/30/">poo-pooed autoconf</a> in the past, the standard
<code class="language-plaintext highlighter-rouge">configure</code> script usually makes it trivial to build and install
software in $HOME. The key ingredient is the <code class="language-plaintext highlighter-rouge">--prefix</code> option:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
</code></pre></div></div>

<p>Most of the time it’s that simple! If you’re linking against your own
libraries and want to use <code class="language-plaintext highlighter-rouge">runpath</code>, it’s a little more complicated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"
</code></pre></div></div>

<p>For <a href="https://cmake.org/">CMake</a>, there’s <code class="language-plaintext highlighter-rouge">CMAKE_INSTALL_PREFIX</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
</code></pre></div></div>

<p>The CMake builds I’ve seen use ELF runpath by default, and no further
configuration may be required to make that work. I’m sure that’s not
always the case, though.</p>

<p>Some software is just a single, static, standalone binary with
<a href="/blog/2016/11/15/">everything baked in</a>. It doesn’t need to be given a prefix, and
installation is as simple as copying the binary into place. For example,
<a href="https://github.com/skeeto/enchive">Enchive</a> works like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin
</code></pre></div></div>

<p>Some software uses its own unique configuration interface. I can respect
that, but it does add some friction for users who now have something
additional and non-transferable to learn. I demonstrated a NetHack build
above, which has a configuration much more involved than it really
should be. Another example is LuaJIT, which uses <code class="language-plaintext highlighter-rouge">make</code> variables that
must be provided consistently on every invocation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install
</code></pre></div></div>

<p>(You <em>can</em> use the “install” target to both build and install, but I
wanted to illustrate the repetition of <code class="language-plaintext highlighter-rouge">PREFIX</code>.)</p>

<p>Some libraries aren’t so smart about <code class="language-plaintext highlighter-rouge">pkg-config</code> and need some
handholding — for example, <a href="https://www.gnu.org/software/ncurses/">ncurses</a>. I mention it because
it’s required for both Vim and Emacs, among many others, so I’m often
building it myself. It ignores <code class="language-plaintext highlighter-rouge">--prefix</code> and needs to be told a
second time where to install things:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH
</code></pre></div></div>

<p>Another issue is that a whole lot of software has been hardcoded for
ncurses 5.x (i.e. <code class="language-plaintext highlighter-rouge">ncurses5-config</code>), and it requires hacks/patching
to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x
for this reason.</p>

<h3 id="learning-through-experience">Learning through experience</h3>

<p>I could go on and on like this, discussing the quirks for the various
libraries and programs that I use. Over the years I’ve gotten used to
many of these issues, committing the solutions to memory.
Unfortunately, even within the same version of a piece of software,
the quirks can change <a href="https://www.debian.org/News/2017/20170617.en.html">between major operating system
releases</a>, so I’m continuously learning my way around new
issues. It’s really given me an appreciation for all the hard work
that package maintainers put into customizing and maintaining software
builds to <a href="https://www.debian.org/doc/manuals/maint-guide/">fit properly into a larger ecosystem</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>The Adversarial Implementation</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/05/03/"/>
    <id>urn:uuid:e6f370f9-1d35-3295-3bd5-74ae20c52a0e</id>
    <updated>2017-05-03T17:51:53Z</updated>
    <category term="c"/><category term="python"/><category term="lang"/>
    <content type="html">
      <![CDATA[<p>When <a href="/blog/2017/03/30/">coding against a standard</a>, whether it’s a programming
language specification or an open API with multiple vendors, a common
concern is the conformity of a particular construct to the standard.
This cannot be determined simply by experimentation, since a piece of
code may work correctly due only to the specifics of a particular
implementation. It works <em>today</em> with <em>this</em> implementation, but it
may not work <em>tomorrow</em> or with a <em>different</em> implementation.
Sometimes an implementation will warn about the use of non-standard
behavior, but this isn’t always the case.</p>

<p>When I’m reasoning about whether or not something is allowed, I like to
imagine an <em>adversarial implementation</em>. If the standard allows some
freedom, this implementation takes an imaginative or unique approach. It
chooses <a href="/blog/2016/05/30/">non-obvious interpretations</a> with possibly unexpected,
but valid, results. This is nearly the opposite of <a href="https://groups.google.com/forum/m/#!msg/boring-crypto/48qa1kWignU/o8GGp2K1DAAJ">djb’s hypothetical
boringcc</a>, though some of the ideas are similar.</p>

<p>Many argue that <a href="http://yarchive.net/comp/linux/gcc.html">this is already the case</a> with modern C and C++
optimizing compilers. Compiler writers are already <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">creative with the
standard</a> in order to squeeze out more performance, even if it’s
at odds with the programmer’s actual intentions. The most prominent
example in C and C++ is <em>strict aliasing</em>, where the optimizer is
deliberately blinded to certain kinds of aliasing because the standard
allows it to be, eliminating some (possibly important) loads. This
happens despite the compiler’s ability to trivially prove that two
particular objects really do alias.</p>

<p>I want to be clear that I’m not talking about the <a href="http://www.catb.org/jargon/html/N/nasal-demons.html">nasal daemon</a>
kind of creativity. That’s not a helpful thought experiment. What I
mean is this: <strong>Can I imagine a conforming implementation that breaks
any assumptions made by the code?</strong></p>

<p>In practice, compilers typically have to bridge multiple
specifications: the language standard, the <a href="/blog/2016/11/17/">platform ABI</a>, and
operating system interface (process startup, syscalls, etc.). This
really ties its hands on how creative it can be with any one of the
specifications. Depending on the situation, the imaginary adversarial
implementation isn’t necessarily running on any particular platform.
If our program is expected to have a long life, useful for many years
to come, we should avoid making too many assumptions about future
computers and imagine an adversarial compiler with few limitations.</p>

<h3 id="c-example">C example</h3>

<p>Take this bit of C:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">foo</span><span class="p">));</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">printf</code> function is variadic, and it relies entirely on the format
string in order to correctly handle all its arguments. The <code class="language-plaintext highlighter-rouge">%d</code>
specifier means that its matching argument is of type <code class="language-plaintext highlighter-rouge">int</code>. The result
of the <code class="language-plaintext highlighter-rouge">sizeof</code> operator is an integer of type <code class="language-plaintext highlighter-rouge">size_t</code>, which has a
different sign and may even be a different size.</p>

<p>Typically this code will work just fine. An <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">size_t</code> are
generally passed the same way, the actual value probably fits in an
<code class="language-plaintext highlighter-rouge">int</code>, and two’s complement means the signedness isn’t an issue due to
the value being positive. From the <code class="language-plaintext highlighter-rouge">printf</code> point of view, it
typically can’t detect that the type is wrong, so everything works by
chance. In fact, it’s hard to imagine a real situation where this
wouldn’t work fine.</p>

<p>However, this still undefined behavior — a scenario where a creative
adversarial implementation can break things. In this case there are a
few options for an adversarial implementation:</p>

<ol>
  <li>Arguments of type <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">size_t</code> are passed differently, so
<code class="language-plaintext highlighter-rouge">printf</code> will load the argument it from the wrong place.</li>
  <li>The implementation doesn’t use two’s complement and even small
positive values have different bit representations.</li>
  <li>The type of <code class="language-plaintext highlighter-rouge">foo</code> is given crazy padding for arbitrary reasons that
makes it so large it doesn’t fit in an <code class="language-plaintext highlighter-rouge">int</code>.</li>
</ol>

<p>What’s interesting about #1 is that <em>this has actually happened</em>. For
example, here’s a C source file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">foo</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">);</span>

<span class="kt">float</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">int</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">foo</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And in another source file:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span>
<span class="nf">foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">x</span><span class="p">;</span>  <span class="c1">// ignore x</span>
    <span class="k">return</span> <span class="n">y</span> <span class="o">*</span> <span class="mi">2</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The type of argument <code class="language-plaintext highlighter-rouge">x</code> differs between the prototype and the
definition, which is undefined behavior. However, since this argument
is ignored, this code will still work correctly on many different
real-world computers, particularly where <code class="language-plaintext highlighter-rouge">float</code> and <code class="language-plaintext highlighter-rouge">int</code> arguments
are passed the same way (i.e. on the stack).</p>

<p>However, in 2003 the x86-64 CPU arrived with its new System V ABI.
Floating point and integer arguments were now passed differently, and
the types of preceding arguments mattered when deciding which register
to use. Some constructs that worked fine, by chance, prior to 2003 would
soon stop working due to what may have seemed like an adversarial
implementation years before.</p>

<h3 id="python-example">Python example</h3>

<p>Let’s look at some Python. This snippet opens a file a million times
without closing any handles.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000000</span><span class="p">):</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"/dev/null"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span>
</code></pre></div></div>

<p>Assuming you have a <code class="language-plaintext highlighter-rouge">/dev/null</code>, this code will work fine without
throwing any exceptions on CPython, the most widely used Python
implementation. CPython uses a deterministic reference counting scheme,
and the handle is automatically closed as soon as its variable falls out
of scope. It’s like having an invisible <code class="language-plaintext highlighter-rouge">f.close()</code> at the end of the
block.</p>

<p>However, this code is incorrect. The deterministic handle closing an
implementation behavior, <a href="https://docs.python.org/3/reference/datamodel.html">not part of the specification</a>. The
operating system limits the number of files a process can have open at
once, and there’s a risk that this resource will run out even though
none of those handles are reachable. Imagine an adversarial Python
implementation trying to break this code. It could sufficiently delay
garbage collection, or even <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20100809-00/?p=13203">have infinite memory</a>, omitting
garbage collection altogether.</p>

<p>Like before, such an implementation eventually did come about: PyPy, a
Python implementation written in Python with a JIT compiler. It uses (by
default) something closer to mark-and-sweep, not reference counting, and
those handles <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/NondeterministicGCII">are left open</a> until the next collection.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt;&gt;&gt;&gt; for i in range(1, 1000000):
....     f = open("/dev/null", "r")
.... 
Traceback (most recent call last):
  File "&lt;stdin&gt;", line 2, in &lt;module&gt;
IOError: [Errno 24] Too many open files: '/dev/null'
</code></pre></div></div>

<h3 id="a-tool-for-understanding-specifications">A tool for understanding specifications</h3>

<p>This fits right in with a broader method of self-improvement:
Occasionally put yourself in the implementor’s shoes. Think about what
it would take to correctly implement the code that you write, either
as a language or the APIs that you call. On reflection, you may find
that some of those things that <em>seem</em> cheap may not be. Your
assumptions may be reasonable, but not guaranteed. (Though it may be
that “reasonable” is perfectly sufficient for your situation.)</p>

<p>An adversarial implementation is one that challenges an assumption
you’ve taken for granted by turning it on its head.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Two Games with Monte Carlo Tree Search</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/04/27/"/>
    <id>urn:uuid:b6f77cb1-01df-3714-4ba0-1859614364da</id>
    <updated>2017-04-27T21:27:50Z</updated>
    <category term="c"/><category term="ai"/><category term="game"/>
    <content type="html">
      <![CDATA[<p><em>Update 2020: A DOS build of Connect Four <a href="https://www.youtube.com/watch?v=K00BylbOQUo">was featured on GET OFF MY
LAWN</a>.</em></p>

<p><a href="https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/">Monte Carlo tree search</a> (MCTS) is the most impressive game
artificial intelligence I’ve ever used. At its core it simulates a
large number of games (<em>playouts</em>), starting from the current game
state, using random moves for each player. Then it simply picks the
move where it won most often. This description is sufficient to spot
one of its most valuable features: <strong>MCTS requires no knowledge of
strategy or effective play</strong>. The game’s rules — enough to simulate
the game — are all that’s needed to allow the AI to make decent moves.
Expert knowledge still makes for a stronger AI, but, more many games,
it’s unnecessary to construct a decent opponent.</p>

<p>A second valuable feature is that it’s easy to parallelize. Unlike
<a href="/blog/2011/08/24/">alpha-beta pruning</a>, which doesn’t mix well with parallel
searches of a Minimax tree, Monte Carlo simulations are practically
independent and can be run in parallel.</p>

<p>Finally, the third valuable feature is that the search can be stopped
at any time. The completion of any single simulation is as good a
stopping point as any. It could be due to a time limit, a memory
limit, or both. In general, the algorithm <em>converges</em> to a best move
rather than suddenly discovering it. The good moves are identified
quickly, and further simulations work to choose among them. More
simulations make for better moves, with exponentially diminishing
returns. Contrasted with Minimax, stopping early has the risk that the
good moves were never explored at all.</p>

<p>To try out MCTS myself, I wrote two games employing it:</p>

<ul>
  <li><a href="https://github.com/skeeto/connect4"><strong>Connect Four</strong></a> [<a href="https://github.com/skeeto/connect4/releases/download/1.0/connect4.exe">.exe x64</a>, 173kB]</li>
  <li><a href="https://github.com/skeeto/yavalath"><strong>Yavalath</strong></a>      [<a href="https://github.com/skeeto/yavalath/releases/download/1.0/yavalath.exe">.exe x64</a>, 174kB]</li>
</ul>

<p>They’re both written in C, for both unix-like and Windows, and should
be <a href="/blog/2017/03/30/">easy to build</a>. <strong>I challenge you to beat them both.</strong> The
Yavalath AI is easier to beat due to having blind spots, which I’ll
discuss below. The Connect Four AI is more difficult and will likely
take a number of tries.</p>

<h3 id="connect-four">Connect Four</h3>

<p><a href="/img/mcts/connect4.png"><img src="/img/mcts/connect4-thumb.png" alt="" /></a></p>

<p>MCTS works very well with Connect Four, and only requires modest
resources: 32MB of memory to store the results of random playouts, and
500,000 game simulations. With a few tweaks, it can even be run in
DOSBox. It stops when it hits either of those limits. In theory,
increasing both would make for stronger moves, but in practice I can’t
detect any difference. It’s like <a href="https://curiosity-driven.org/pi-approximation">computing pi with Monte Carlo</a>,
where eventually it just runs out of precision to make any more
progress.</p>

<p>Based on my simplified description above, you might wonder why it needs
all that memory. Not only does MCTS need to track its win/loss ratio for
each available move from the current state, it tracks the win/loss ratio
for moves in the states behind those moves. A large chunk of the game
tree is kept in memory to track all of the playout results. This is why
MCTS needs a lot more memory than Minimax, which can discard branches
that have been searched.</p>

<p><img src="/img/mcts/tree.svg" alt="" /></p>

<p>A convenient property of this tree is that the branch taken in the
actual game can be re-used in a future search. The root of the tree
becomes the node representing the taken game state, which has already
seen a number of playouts. Even better, MCTS is weighted towards
exploring good moves over bad moves, and good moves are more likely to
be taken in the real game. In general, a significant portion of the tree
gets to be reused in a future search.</p>

<p>I’m going to skip most of the details of the algorithm itself and focus
on my implementation. Other articles do a better job at detailing the
algorithm than I could.</p>

<p>My Connect Four engine doesn’t use dynamic allocation for this tree (or
at all). Instead it manages a static buffer — an array of tree nodes,
each representing a game state. All nodes are initially chained together
into a linked list of free nodes. As the tree is built, nodes are pulled
off the free list and linked together into a tree. When the game
advances to the next state, nodes on unreachable branches are added back
to the free list.</p>

<p>If at any point the free list is empty when a new node is needed, the
current search aborts. This is the out-of-memory condition, and no more
searching can be performed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Connect Four is normally a 7 by 6 grid. */</span>
<span class="cp">#define CONNECT4_WIDTH  7
#define CONNECT4_HEIGHT 6
</span>
<span class="k">struct</span> <span class="n">connect4_node</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">next</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>      <span class="c1">// "pointer" to next node</span>
    <span class="kt">uint32_t</span> <span class="n">playouts</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>  <span class="c1">// number of playouts</span>
    <span class="kt">float</span>    <span class="n">score</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>     <span class="c1">// pseudo win/loss ratio</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Rather than native C pointers, the structure uses 32-bit indexes into
the master array. This saves a lot of memory on 64-bit systems, and the
structure is the same size no matter the pointer size of the host. The
<code class="language-plaintext highlighter-rouge">next</code> field points to the next state for the nth move. Since 0 is a
valid index, -1 represents null (<code class="language-plaintext highlighter-rouge">CONNECT4_NULL</code>).</p>

<p>Each column is a potential move, so there are <code class="language-plaintext highlighter-rouge">CONNECT4_WIDTH</code>
possible moves at any given state. Each move has a floating point
score and a total number of playouts through that move. In my
implementation, <strong>the search can also halt due to an overflow in a
playout counter</strong>. The search can no longer be tracked in this
representation, so it has to stop. This generally only happens when
the game is nearly over and it’s grinding away on a small number of
possibilities.</p>

<p>Note that the actual game state (piece positions) is not tracked in the
node structure. That’s because it’s implicit. We know the state of the
game at the root, and simulating the moves while descending the tree
will keep track of the board state at the current node. That’s more
memory savings.</p>

<p>The state itself is a pair of bitboards, one for each player. Each
position on the grid gets a bit on each bitboard. The bitboard is very
fast to manipulate, and win states are checked with just a handful of
bit operations. My intention was to make playouts as fast as possible.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">connect4_ai</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>         <span class="c1">// game state at root (bitboard)</span>
    <span class="kt">uint64_t</span> <span class="n">rng</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>           <span class="c1">// random number generator state</span>
    <span class="kt">uint32_t</span> <span class="n">nodes_available</span><span class="p">;</span>  <span class="c1">// total number of nodes available</span>
    <span class="kt">uint32_t</span> <span class="n">nodes_allocated</span><span class="p">;</span>  <span class="c1">// number of nodes in the tree</span>
    <span class="kt">uint32_t</span> <span class="n">root</span><span class="p">;</span>             <span class="c1">// "pointer" to root node</span>
    <span class="kt">uint32_t</span> <span class="n">free</span><span class="p">;</span>             <span class="c1">// "pointer" to free list</span>
    <span class="kt">int</span> <span class="n">turn</span><span class="p">;</span>                  <span class="c1">// whose turn (0 or 1) at the root?</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">nodes_available</code> and <code class="language-plaintext highlighter-rouge">nodes_allocated</code> are not necessary for
correctness nor speed. They’re useful for diagnostics and debugging.</p>

<p>All the functions that operate on these two structures are
straightforward, except for <code class="language-plaintext highlighter-rouge">connect4_playout</code>, a recursive function
which implements the bulk of MCTS. Depending on the state of the node
it’s at, it does one of two things:</p>

<ul>
  <li>
    <p>If there are unexplored moves (<code class="language-plaintext highlighter-rouge">playouts == 0</code>), it randomly chooses
an unplayed move, allocates exactly one node for the state behind that
move, and simulates the rest of the game in a loop, without recursion
or allocating any more nodes.</p>
  </li>
  <li>
    <p>If all moves have been explored at least once, it uses an upper
confidence bound (UCB1) to randomly choose a move, weighed towards
both moves that are under-explored and moves which are strongest.
Striking that balance is one of the challenges. It recurses into that
next state, then updates the node with the result as it propagates
back to the root.</p>
  </li>
</ul>

<p>That’s pretty much all there is to it.</p>

<h3 id="yavalath">Yavalath</h3>

<p><a href="/img/mcts/yavalath.png"><img src="/img/mcts/yavalath-thumb.png" alt="" /></a></p>

<p><a href="http://cambolbro.com/games/yavalath/">Yavalath</a> is a <a href="http://www.genetic-programming.org/hc2012/Browne-Paper-3-Yavalath-07.pdf">board game invented by a computer
program</a>. It’s a pretty fascinating story. The depth and strategy
are disproportionately deep relative to its dead simple rules: Get four
in a row without first getting three in a row. The game revolves around
forced moves.</p>

<p>The engine is structured almost identically to the Connect Four engine.
It uses 32-bit indexes instead of pointers. The game state is a pair of
bitboards, with end-game masks <a href="/blog/2016/11/15/">computed at compile time via
metaprogramming</a>. The AI allocates the tree from a single, massive
buffer — multiple GBs in this case, dynamically scaled to the available
physical memory. And the core MCTS function is nearly identical.</p>

<p>One important difference is that identical game states — states where
the pieces on the board are the same, but the node was reached through
a different series of moves — are coalesced into a single state in the
tree. This state deduplication is done through a hash table. This
saves on memory and allows multiple different paths through the game
tree to share playouts. It comes at a cost of including the game state
in the node (so it can be identified in the hash table) and reference
counting the nodes (since they might have more than one parent).</p>

<p>Unfortunately the AI has blind spots, and once you learn to spot them it
becomes easy to beat consistently. It can’t spot certain kinds of forced
moves, so it always falls for the same tricks. The <em>official</em> Yavalath
AI is slightly stronger than mine, but has a similar blindness. I think
MCTS just isn’t quite a good fit for Yavalath.</p>

<p><strong>The AI’s blindness is caused by <em>shallow traps</em></strong>, a common problem
for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is
a branch in the game tree where the game will abruptly end in a small
number of turns. If the random tree search doesn’t luckily stumble
upon a trap during its random traversal, it can’t take it into account
in its final decision. A skilled player will lead the game towards one
of these traps, and the AI will blunder along, not realizing what’s
happened until its too late.</p>

<p>I almost feel bad for it when this happens. If you watch the memory
usage and number of playouts, once it falls into a trap, you’ll see it
using almost no memory while performing a ton of playouts. It’s
desperately, frantically searching for a way out of the trap. But it’s
too late, little AI.</p>

<h3 id="another-tool-in-the-toolbelt">Another Tool in the Toolbelt</h3>

<p>I’m really happy to have sunk a couple weekends into playing with MCTS.
It’s not always a great fit, as seen with Yavalath, but it’s a really
neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready
to use it should I run into an appropriate problem in the future.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>How to Write Portable C Without Complicating Your Build</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/30/"/>
    <id>urn:uuid:e1651834-8033-3bfa-6eaf-00bc38a0584a</id>
    <updated>2017-03-30T04:06:58Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you’re writing a non-GUI C application intended to run on a
number of operating systems: Linux, the various BSDs, macOS, <a href="https://en.wikipedia.org/wiki/Illumos">classical
unix</a>, and perhaps even something as exotic as Windows. It might
sound like a rather complicated problem. These operating systems have
slightly different interfaces (or <em>very</em> different in one case), and they
run different variants of the standard unix tools — a problem for
portable builds.</p>

<p>With some up-front attention to detail, this is actually not terribly
difficult. <strong>Unix-like systems are probably the least diverse and least
buggy they’ve ever been.</strong> Writing portable code is really just a matter
of <strong>coding to the standards</strong> and ignoring extensions unless
<em>absolutely</em> necessary. Knowing what’s standard and what’s extension is
the tricky part, but I’ll explain how to find this information.</p>

<p>You might be tempted to reach for an <a href="https://undeadly.org/cgi?action=article;sid=20170930133438">overly complicated</a> solution
such as GNU Autoconf. Sure, it creates a configure script with the
familiar, conventional interface. This has real value. But do you
<em>really</em> need to run a single-threaded gauntlet of hundreds of
feature/bug tests for things that sometimes worked incorrectly in some
weird unix variant back in the 1990s? On a machine with many cores
(parallel build, <code class="language-plaintext highlighter-rouge">-j</code>), this may very well be the slowest part of the
whole build process.</p>

<p>For example, the configure script for Emacs checks that the compiler
supplies <code class="language-plaintext highlighter-rouge">stdlib.h</code>, <code class="language-plaintext highlighter-rouge">string.h</code>, and <code class="language-plaintext highlighter-rouge">getenv</code> — things that were
standardized nearly 30 years ago. It also checks for a slew of POSIX
functions that have been standard since 2001.</p>

<p>There’s a much easier solution: Document that the application requires,
say, C99 and POSIX.1-2001. It’s the responsibility of the person
building the application to supply these implementations, so there’s no
reason to waste time testing for it.</p>

<h3 id="how-to-code-to-the-standards">How to code to the standards</h3>

<p>Suppose there’s some function you want to use, but you’re not sure if
it’s standard or an extension. Or maybe you don’t know what standard it
comes from. Luckily the man pages document this stuff very well,
especially on Linux. Check the friendly “CONFORMING TO” section. For
example, look at <a href="https://manpages.debian.org/jessie/manpages-dev/getenv.3.en.html">getenv(3)</a>. Here’s what that section has to
say:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.
</code></pre></div></div>

<p>This says this function comes from the original C standard. It’s <em>always</em>
available on anything that claims to be a C implementation. The man page
also documents <code class="language-plaintext highlighter-rouge">secure_getenv()</code>, which is a GNU extension: to be avoided
in anything intended to be portable.</p>

<p>What about <a href="https://manpages.debian.org/jessie/manpages-dev/sleep.3.en.html">sleep(3)</a>?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFORMING TO
    POSIX.1-2001.
</code></pre></div></div>

<p>This function isn’t part of standard C, but it’s available on any system
claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the
program needs to run on an operating system not implementing this POSIX
standard (i.e. Windows), you’ll need to call an alternative function,
probably inside a different <code class="language-plaintext highlighter-rouge">#if .. #endif</code> branch. More on this in a
moment.</p>

<p>If you’re coding to POSIX, you <a href="http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_02.html"><em>must</em> define the <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code>
feature test macro</a> to the standard you intend to use prior to
any system header includes:</p>

<blockquote>
  <p>A POSIX-conforming application should ensure that the feature test
macro <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> is defined before inclusion of any header.</p>
</blockquote>

<p>For example, to properly access POSIX.1-2001 functions in your
application, define <code class="language-plaintext highlighter-rouge">_POSIX_C_SOURCE</code> to <code class="language-plaintext highlighter-rouge">200112L</code>. With this defined,
it’s safe to assume access to all of C and everything from that standard
of POSIX. You can do this at the top of your sources, but I personally
like the tidiness of a global <code class="language-plaintext highlighter-rouge">config.h</code> that gets included before
everything.</p>

<h3 id="how-to-create-a-portable-build">How to create a portable build</h3>

<p>So you’ve written clean, portable C to the standards. How do you build
this application? The natural choice is <code class="language-plaintext highlighter-rouge">make</code>. It’s available
everywhere and it’s part of POSIX.</p>

<p>Again, the tricky part is teasing apart the standard from the extension.
I’m a long-time sinner in this regard, having far too often written
Makefiles that depend on GNU Make extensions. This is a real pain when
building programs on systems without the GNU utilities. I’ve been making
amends (and <a href="https://marc.info/?l=openbsd-bugs&amp;m=148815538325392&amp;w=2">finding</a> some <a href="https://marc.info/?l=openbsd-bugs&amp;m=148734102504016&amp;w=2">bugs</a> as a result).</p>

<p>No implementation makes the division clear in its documentation, and
especially don’t bother looking at the GNU Make manual. Your best
resource is <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">the standard itself</a>. If you’re already familiar with
<code class="language-plaintext highlighter-rouge">make</code>, coding to the standard is largely a matter of <em>unlearning</em> the
various extensions you know.</p>

<p>Outside of <a href="/blog/2016/04/30/">some hacks</a>, this means you don’t get conditionals
(<code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">else</code>, etc.). With some practice, both with sticking to portable
code and writing portable Makefiles, you’ll find that you <em>don’t really
need them</em>. Following the macro conventions will cover most situations.
For example:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CC</code>: the C compiler program</li>
  <li><code class="language-plaintext highlighter-rouge">CFLAGS</code>: flags to pass to the C compiler</li>
  <li><code class="language-plaintext highlighter-rouge">LDFLAGS</code>: flags to pass to the linker (via the C compiler)</li>
  <li><code class="language-plaintext highlighter-rouge">LDLIBS</code>: libraries to pass to the linker</li>
</ul>

<p>You don’t need to do anything weird with the assignments. The user
invoking <code class="language-plaintext highlighter-rouge">make</code> can override them easily. For example, here’s part of a
Makefile:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CC     = c99
CFLAGS = -Wall -Wextra -Os
</code></pre></div></div>

<p>But the user wants to use <code class="language-plaintext highlighter-rouge">clang</code>, and their system needs to explicitly
link <code class="language-plaintext highlighter-rouge">-lsocket</code> (e.g. Solaris). The user can override the macro
definitions on the command line:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make CC=clang LDLIBS=-lsocket
</code></pre></div></div>

<p>The same rules apply to the programs you invoke from the Makefile. Read
the standards documents and ignore your system’s man pages as to avoid
accidentally using an extension. It’s especially valuable to learn <a href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html">the
Bourne shell language</a> and avoid any accidental bashisms in your
Makefiles and scripts. The <code class="language-plaintext highlighter-rouge">dash</code> shell is good for testing your scripts.</p>

<p>Makefiles conforming to the standard will, unfortunately, be more verbose
than those taking advantage of a particular implementation. If you know
how to code Bourne shell — which is not terribly difficult to learn —
then you might even consider hand-writing a <code class="language-plaintext highlighter-rouge">configure</code> script to
generate the Makefile (a la metaprogramming). This gives you a more
flexible language with conditionals, and, being generated, redundancy in
the Makefile no longer matters.</p>

<p>As someone who frequently dabbles with BSD systems, my life has gotten a
lot easier since learning to write portable Makefiles and scripts.</p>

<h3 id="but-what-about-windows">But what about Windows</h3>

<p>It’s the elephant in the room and I’ve avoided talking about it so far.
If you want to <a href="/blog/2016/06/13/">build with Visual Studio’s command line tools</a> —
something I do on occasion — build portability goes out the window.
Visual Studio has <code class="language-plaintext highlighter-rouge">nmake.exe</code>, which nearly conforms to POSIX <code class="language-plaintext highlighter-rouge">make</code>.
However, without the standard unix utilities and with the completely
foreign compiler interface for <code class="language-plaintext highlighter-rouge">cl.exe</code>, there’s absolutely no hope of
writing a Makefile portable to this situation.</p>

<p>The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the
unix utilities, though it has <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">the problem</a> of linking against
<code class="language-plaintext highlighter-rouge">msvcrt.dll</code>. Another option is a separate Makefile dedicated to
<code class="language-plaintext highlighter-rouge">nmake.exe</code> and the Visual Studio toolchain. Good luck defining a
correctly working “clean” target with <code class="language-plaintext highlighter-rouge">del.exe</code>.</p>

<p>My preferred approach lately is an amalgamation build (as seen in
<a href="https://github.com/skeeto/enchive">Enchive</a>): Carefully concatenate all the application’s sources
into one giant source file. First concatenate all the headers in the
right order, followed by all the C files. Use <code class="language-plaintext highlighter-rouge">sed</code> to remove and local
includes. You can do this all on a unix system with the nice utilities,
then point <code class="language-plaintext highlighter-rouge">cl.exe</code> at the amalgamation for the Visual Studio build.
It’s not very useful for actual development (i.e. you don’t want to edit
the amalgamation), but that’s what MinGW-w64 resolves.</p>

<p>What about all those POSIX functions? You’ll need to find Win32
replacements on MSDN. I prefer to do this is by abstracting those
operating system calls. For example, compare POSIX <code class="language-plaintext highlighter-rouge">sleep(3)</code> and <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx">Win32
<code class="language-plaintext highlighter-rouge">Sleep()</code></a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#if defined(_WIN32)
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">Sleep</span><span class="p">(</span><span class="n">s</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">);</span>  <span class="c1">// TODO: handle overflow, maybe</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* __unix__ */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">my_sleep</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">sleep</span><span class="p">(</span><span class="n">s</span><span class="p">);</span>  <span class="c1">// TODO: fix signal interruption</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Then the rest of the program calls <code class="language-plaintext highlighter-rouge">my_sleep()</code>. There’s another example
in <a href="/blog/2017/03/01/">the OpenMP article</a> with <code class="language-plaintext highlighter-rouge">pwrite(2)</code> and <code class="language-plaintext highlighter-rouge">WriteFile()</code>. This
demonstrates that supporting a bunch of different unix-like systems is
really easy, but introducing Windows portability adds a disproportionate
amount of complexity.</p>

<h4 id="caveat-paths-and-filenames">Caveat: paths and filenames</h4>

<p>There’s one major complication with filenames for applications portable
to Windows. In the unix world, filenames are null-terminated bytestrings.
Typically these are Unicode strings encoded as UTF-8, but it’s not
necessarily so. The kernel just sees bytestrings. A bytestring doesn’t
necessarily have a formal Unicode representation, which can be a problem
for <a href="https://www.python.org/dev/peps/pep-0383/">languages that want filenames to be Unicode strings</a>
(<a href="http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html">also</a>).</p>

<p>On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up
being neither. They’re really null-terminated unsigned 16-bit integer
arrays. It’s <em>almost</em> UTF-16 except that Windows allows unpaired
surrogates. This means Windows filenames <em>also</em> don’t have a formal
Unicode representation, but in a completely different way than unix. Some
<a href="https://simonsapin.github.io/wtf-8/">heroic efforts have gone into working around this issue</a>.</p>

<p>As a result, it’s highly non-trivial to correctly support all possible
filenames on both systems in the same program, <em>especially</em> when they’re
passed as command line arguments.</p>

<h3 id="summary">Summary</h3>

<p>The key points are:</p>

<ol>
  <li>Document the standards your application requires and strictly stick
to them.</li>
  <li>Ignore the vendor documentation if it doesn’t clearly delineate
extensions.</li>
</ol>

<p>This was all a discussion of non-GUI applications, and I didn’t really
touch on libraries. Many libraries are simple to access in the build
(just add it to <code class="language-plaintext highlighter-rouge">LDLIBS</code>), but some libraries — GUIs in particular — are
particularly complicated to manage portably and will require a more
complex solution (pkg-config, CMake, Autoconf, etc.).</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>OpenMP and pwrite()</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/03/01/"/>
    <id>urn:uuid:dfdf8ca6-51aa-3a15-6bf0-98b39f20652a</id>
    <updated>2017-03-01T21:22:24Z</updated>
    <category term="c"/><category term="posix"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>The most common way I introduce multi-threading to <a href="/blog/2015/07/10/">small C
programs</a> is with OpenMP (Open Multi-Processing). It’s typically
used as compiler pragmas to parallelize computationally expensive
loops — iterations are processed by different threads in some
arbitrary order.</p>

<p>Here’s an example that computes the <a href="/blog/2011/11/28/">frames of a video</a> in
parallel. Despite being computed out of order, each frame is written
in order to a large buffer, then written to standard output all at
once at the end.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_frames</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">output</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">DEFAULT_BETA</span><span class="p">;</span>

<span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="o">&amp;</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
</code></pre></div></div>

<p>Adding OpenMP to this program is much simpler than introducing
low-level threading semantics with, say, Pthreads. With care, there’s
often no need for explicit thread synchronization. It’s also fairly
well supported by many vendors, even Microsoft (up to OpenMP 2.0), so
a multi-threaded OpenMP program is quite portable without <code class="language-plaintext highlighter-rouge">#ifdef</code>.</p>

<p>There’s real value this pragma API: <strong>The above example would still
compile and run correctly even when OpenMP isn’t available.</strong> The
pragma is ignored and the program just uses a single core like it
normally would. It’s a slick fallback.</p>

<p>When a program really <em>does</em> require synchronization there’s
<code class="language-plaintext highlighter-rouge">omp_lock_t</code> (mutex lock) and the expected set of functions to operate
on them. This doesn’t have the nice fallback, so I don’t like to use
it. Instead, I prefer <code class="language-plaintext highlighter-rouge">#pragma omp critical</code>. It nicely maintains the
OpenMP-unsupported fallback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* schedule(dynamic, 1): treat the loop like a work queue */</span>
<span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="cp">#pragma omp critical
</span>    <span class="p">{</span>
        <span class="n">write</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">frame</span><span class="p">));</span>
    <span class="p">}</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This would append the output to some output file in an arbitrary
order. The critical section <a href="/blog/2016/08/03/">prevents interleaving of
outputs</a>.</p>

<p>There are a couple of problems with this example:</p>

<ol>
  <li>
    <p>Only one thread can write at a time. If the write takes too long,
other threads will queue up behind the critical section and wait.</p>
  </li>
  <li>
    <p>The output frames will be out of order, which is probably
inconvenient for consumers. If the output is seekable this can be
solved with <code class="language-plaintext highlighter-rouge">lseek()</code>, but that only makes the critical section
even more important.</p>
  </li>
</ol>

<p>There’s an easy fix for both, and eliminates the need for a critical
section: POSIX <code class="language-plaintext highlighter-rouge">pwrite()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">pwrite</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="kt">off_t</span> <span class="n">offset</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s like <code class="language-plaintext highlighter-rouge">write()</code> but has an offset parameter. Unlike <code class="language-plaintext highlighter-rouge">lseek()</code>
followed by a <code class="language-plaintext highlighter-rouge">write()</code>, multiple threads and processes can, in
parallel, safely write to the same file descriptor at different file
offsets. The catch is that <strong>the output must be a file, not a pipe</strong>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_frames</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">frame</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">compute_theta</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
    <span class="n">compute_frame</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">beta</span><span class="p">);</span>
    <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">frame</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">size</span> <span class="o">*</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">frame</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s no critical section, the writes can interleave, and the output
is in order.</p>

<p>If you’re concerned about standard output not being seekable (it often
isn’t), keep in mind that it will work just fine when invoked like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./compute_frames &gt; frames.ppm
</code></pre></div></div>

<h3 id="windows-portability">Windows Portability</h3>

<p>I talked about OpenMP being really portable, then used POSIX
functions. Fortunately the Win32 <code class="language-plaintext highlighter-rouge">WriteFile()</code> function has an
“overlapped” parameter that works just like <code class="language-plaintext highlighter-rouge">pwrite()</code>. Typically
rather than call either directly, I’d wrap the write like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">out</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">DWORD</span> <span class="n">written</span><span class="p">;</span>
    <span class="n">OVERLAPPED</span> <span class="n">offset</span> <span class="o">=</span> <span class="p">{.</span><span class="n">Offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">};</span>
    <span class="k">return</span> <span class="n">WriteFile</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">offset</span><span class="p">);</span>
<span class="p">}</span>

<span class="cp">#else </span><span class="cm">/* POSIX */</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kt">int</span>
<span class="nf">write_frame</span><span class="p">(</span><span class="k">struct</span> <span class="n">frame</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pwrite</span><span class="p">(</span><span class="n">STDOUT_FILENO</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">offset</span><span class="p">)</span> <span class="o">==</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif
</span></code></pre></div></div>

<p>Except for switching to <code class="language-plaintext highlighter-rouge">write_frame()</code>, the OpenMP part remains
untouched.</p>

<h3 id="real-world-example">Real World Example</h3>

<p>Here’s an example in a real program:</p>

<p><a href="https://gist.github.com/skeeto/d7e17bb2aa40907a3405c3933cb1f936" class="download">julia.c</a></p>

<p>Notice because of <code class="language-plaintext highlighter-rouge">pwrite()</code> there’s no piping directly into
<code class="language-plaintext highlighter-rouge">ppmtoy4m</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./julia &gt; output.ppm
$ ppmtoy4m -F 60:1 &lt; output.ppm &gt; output.y4m
$ x264 -o output.mp4 output.y4m
</code></pre></div></div>

<p><a href="/video/?v=julia-256" class="download">output.mp4</a></p>

<video src="https://skeeto.s3.amazonaws.com/share/julia-256.mp4" controls="" loop="" crossorigin="anonymous">
</video>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Asynchronous Requests from Emacs Dynamic Modules</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/02/14/"/>
    <id>urn:uuid:00a59e4f-268c-343f-e6c6-bb23cde265de</id>
    <updated>2017-02-14T02:30:00Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="linux"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>A few months ago I had a discussion with Vladimir Kazanov about his
<a href="https://github.com/vkazanov/toy-orgfuse">Orgfuse</a> project: a Python script that exposes an Emacs
Org-mode document as a <a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace">FUSE filesystem</a>. It permits other
programs to navigate the structure of an Org-mode document through the
standard filesystem APIs. I suggested that, with the new dynamic
modules in Emacs 25, Emacs <em>itself</em> could serve a FUSE filesystem. In
fact, support for FUSE services in general could be an package of his
own.</p>

<p>So that’s what he did: <a href="https://github.com/vkazanov/elfuse"><strong>Elfuse</strong></a>. It’s an old joke that
Emacs is an operating system, and here it is handling system calls.</p>

<p>However, there’s a tricky problem to solve, an issue also present <a href="/blog/2016/11/05/">my
joystick module</a>. Both modules handle asynchronous events —
filesystem requests or joystick events — but Emacs runs the event loop
and owns the main thread. The external events somehow need to feed
into the main event loop. It’s even more difficult with FUSE because
FUSE <em>also</em> wants control of its own thread for its own event loop.
This requires Elfuse to spawn a dedicated FUSE thread and negotiate a
request/response hand-off.</p>

<p>When a filesystem request or joystick event arrives, how does Emacs
know to handle it? The simple and obvious solution is to poll the
module from a timer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">queue</span> <span class="n">requests</span><span class="p">;</span>

<span class="n">emacs_value</span>
<span class="nf">Frequest_next</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emacs_value</span> <span class="n">next</span> <span class="o">=</span> <span class="n">Qnil</span><span class="p">;</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">queue_length</span><span class="p">(</span><span class="n">requests</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">queue_pop</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
        <span class="n">next</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_empty</span><span class="p">,</span> <span class="n">request</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">request</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">next</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And then ask Emacs to check the module every, say, 10ms:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request--poll</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">next</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">when</span> <span class="nv">next</span>
      <span class="p">(</span><span class="nv">request-handle</span> <span class="nv">next</span><span class="p">))))</span>

<span class="p">(</span><span class="nv">run-at-time</span> <span class="mi">0</span> <span class="mf">0.01</span> <span class="nf">#'</span><span class="nv">request--poll</span><span class="p">)</span>
</code></pre></div></div>

<p>Blocking directly on the module’s event pump with Emacs’ thread would
prevent Emacs from doing important things like, you know, <em>being a
text editor</em>. The timer allows it to handle its own events
uninterrupted. It gets the job done, but it’s far from perfect:</p>

<ol>
  <li>
    <p>It imposes an arbitrary latency to handling requests. Up to the
poll period could pass before a request is handled.</p>
  </li>
  <li>
    <p>Polling the module 100 times per second is inefficient. Unless you
really enjoy recharging your laptop, that’s no good.</p>
  </li>
</ol>

<p>The poll period is a sliding trade-off between latency and battery
life. If only there was some mechanism to, ahem, <em>signal</em> the Emacs
thread, informing it that a request is waiting…</p>

<h3 id="sigusr1">SIGUSR1</h3>

<p>Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals,
which is exactly the mechanism we need. The interface is a “key”
binding on <code class="language-plaintext highlighter-rouge">special-event-map</code>, the keymap that handles these kinds of
events. When the signal arrives, Emacs queues it up for the main event
loop.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[sigusr1]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-handle</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>The module blocks on its own thread on its own event pump. When a
request arrives, it queues the request, rings the bell for Emacs to
come handle it (<code class="language-plaintext highlighter-rouge">raise()</code>), and waits on a semaphore. For illustration
purposes, assume the module reads requests from and writes responses
to a file descriptor, like a socket.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">event_fd</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">request</span> <span class="n">request</span><span class="p">;</span>
<span class="n">sem_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
    <span class="cm">/* Blocking read for request event */</span>
    <span class="n">read</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">event</span><span class="p">));</span>

    <span class="cm">/* Put request on the queue */</span>
    <span class="n">queue_lock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">queue_push</span><span class="p">(</span><span class="n">requests</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">);</span>
    <span class="n">queue_unlock</span><span class="p">(</span><span class="n">requests</span><span class="p">);</span>
    <span class="n">raise</span><span class="p">(</span><span class="n">SIGUSR1</span><span class="p">);</span>  <span class="c1">// TODO: Should raise() go inside the lock?</span>

    <span class="cm">/* Wait for Emacs */</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">sem_wait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">sem</span><span class="p">))</span>
        <span class="p">;</span>

    <span class="cm">/* Reply with Emacs' response */</span>
    <span class="n">write</span><span class="p">(</span><span class="n">event_fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">response</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">sem_wait()</code> is in a loop because signals will wake it up
prematurely. In fact, it may even wake up due to its own signal on the
line before. This is the only way this particular use of <code class="language-plaintext highlighter-rouge">sem_wait()</code>
might fail, so there’s no need to check <code class="language-plaintext highlighter-rouge">errno</code>.</p>

<p>If there are multiple module threads making requests to the same
global queue, the lock is necessary to protect the queue. The
semaphore is only for blocking the thread until Emacs has finished
writing its particular response. Each thread has its own semaphore.</p>

<p>When Emacs is done writing the response, it releases the module thread
by incrementing the semaphore. It might look something like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">emacs_value</span>
<span class="nf">Frequest_complete</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">request</span> <span class="o">*</span><span class="n">request</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">request</span><span class="p">)</span>
        <span class="n">sem_post</span><span class="p">(</span><span class="o">&amp;</span><span class="n">request</span><span class="o">-&gt;</span><span class="n">sem</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">Qnil</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The top-level handler dispatches to the specific request handler,
calling <code class="language-plaintext highlighter-rouge">request-complete</code> above when it’s done.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">request-handle</span> <span class="p">(</span><span class="nv">next</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">condition-case</span> <span class="nv">e</span>
      <span class="p">(</span><span class="nv">cl-ecase</span> <span class="p">(</span><span class="nv">request-type</span> <span class="nv">next</span><span class="p">)</span>
        <span class="p">(</span><span class="ss">:open</span>  <span class="p">(</span><span class="nv">request-handle-open</span>  <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:close</span> <span class="p">(</span><span class="nv">request-handle-close</span> <span class="nv">next</span><span class="p">))</span>
        <span class="p">(</span><span class="ss">:read</span>  <span class="p">(</span><span class="nv">request-handle-read</span>  <span class="nv">next</span><span class="p">)))</span>
    <span class="p">(</span><span class="nb">error</span> <span class="p">(</span><span class="nv">request-respond-as-error</span> <span class="nv">next</span> <span class="nv">e</span><span class="p">)))</span>
  <span class="p">(</span><span class="nv">request-complete</span><span class="p">))</span>
</code></pre></div></div>

<p>This SIGUSR1+semaphore mechanism is roughly how Elfuse currently
processes requests.</p>

<h3 id="making-it-work-on-windows">Making it work on Windows</h3>

<p>Windows doesn’t have signals. This isn’t a problem for Elfuse since
Windows doesn’t have FUSE either. Nor does it matter for Joymacs since
XInput isn’t event-driven and always requires polling. But someday
someone will need this mechanism for a dynamic module on Windows.</p>

<p>Fortunately there’s a solution: <em>input language change</em> events,
<code class="language-plaintext highlighter-rouge">WM_INPUTLANGCHANGE</code>. It’s also on <code class="language-plaintext highlighter-rouge">special-event-map</code>:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-key</span> <span class="nv">special-event-map</span> <span class="nv">[language-change]</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">request-process</span> <span class="p">(</span><span class="nv">request-next</span><span class="p">))))</span>
</code></pre></div></div>

<p>Instead of <code class="language-plaintext highlighter-rouge">raise()</code> (or <code class="language-plaintext highlighter-rouge">pthread_kill()</code>), broadcast the window event
with <code class="language-plaintext highlighter-rouge">PostMessage()</code>. Outside of invoking the <code class="language-plaintext highlighter-rouge">language-change</code> key
binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t
belong to any particular window. We don’t <em>really</em> want to change the
input language, after all.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PostMessageA</span><span class="p">(</span><span class="n">HWND_BROADCAST</span><span class="p">,</span> <span class="n">WM_INPUTLANGCHANGE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>Naturally you’ll also need to replace the POSIX threading primitives
with the Windows versions (<code class="language-plaintext highlighter-rouge">CreateThread()</code>, <code class="language-plaintext highlighter-rouge">CreateSemaphore()</code>,
etc.). With a bit of abstraction in the right places, it should be
pretty easy to support both POSIX and Windows in these asynchronous
dynamic module events.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Manual Control Flow Guard in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/21/"/>
    <id>urn:uuid:f185405a-3e30-3612-7a21-6d4ec450519d</id>
    <updated>2017-01-21T22:44:15Z</updated>
    <category term="c"/><category term="linux"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p>Recent versions of Windows have a new exploit mitigation feature
called <a href="http://sjc1-te-ftp.trendmicro.com/assets/wp/exploring-control-flow-guard-in-windows10.pdf"><em>Control Flow Guard</em></a> (CFG). Before an indirect function
call — e.g. function pointers and virtual functions — the target
address checked against a table of valid call addresses. If the
address isn’t the entry point of a known function, then the program is
aborted.</p>

<p>If an application has a buffer overflow vulnerability, an attacker may
use it to overwrite a function pointer and, by the call through that
pointer, control the execution flow of the program. This is one way to
initiate a <a href="https://skeeto.s3.amazonaws.com/share/p15-coffman.pdf"><em>Return Oriented Programming</em></a> (ROP) attack, where
the attacker constructs <a href="https://github.com/JonathanSalwan/ROPgadget">a chain of <em>gadget</em> addresses</a> — a
gadget being a couple of instructions followed by a return
instruction, all in the original program — using the indirect call as
the starting point. The execution then flows from gadget to gadget so
that the program does what the attacker wants it to do, all without
the attacker supplying any code.</p>

<p>The two most widely practiced ROP attack mitigation techniques today
are <em>Address Space Layout Randomization</em> (ASLR) and <em>stack
protectors</em>. The former randomizes the base address of executable
images (programs, shared libraries) so that process memory layout is
unpredictable to the attacker. The addresses in the ROP attack chain
depend on the run-time memory layout, so the attacker must also find
and exploit an <a href="https://github.com/torvalds/linux/blob/4c9eff7af69c61749b9eb09141f18f35edbf2210/Documentation/sysctl/kernel.txt#L373">information leak</a> to bypass ASLR.</p>

<p>For stack protectors, the compiler allocates a <em>canary</em> on the stack
above other stack allocations and sets the canary to a per-thread
random value. If a buffer overflows to overwrite the function return
pointer, the canary value will also be overwritten. Before the
function returns by the return pointer, it checks the canary. If the
canary doesn’t match the known value, the program is aborted.</p>

<p><img src="/img/cfg/canary.svg" alt="" /></p>

<p>CFG works similarly — performing a check prior to passing control to
the address in a pointer — except that instead of checking a canary,
it checks the target address itself. This is a lot more sophisticated,
and, unlike a stack canary, essentially requires coordination by the
platform. The check must be informed on all valid call targets,
whether from the main program or from shared libraries.</p>

<p>While not (yet?) widely deployed, a worthy mention is <a href="http://clang.llvm.org/docs/SafeStack.html">Clang’s
SafeStack</a>. Each thread gets <em>two</em> stacks: a “safe stack” for
return pointers and other safely-accessed values, and an “unsafe
stack” for buffers and such. Buffer overflows will corrupt other
buffers but will not overwrite return pointers, limiting the effect of
their damage.</p>

<h3 id="an-exploit-example">An exploit example</h3>

<p>Consider this trivial C program, <code class="language-plaintext highlighter-rouge">demo.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It reads a name into a buffer and prints it back out with a greeting.
While trivial, it’s far from innocent. That naive call to <code class="language-plaintext highlighter-rouge">gets()</code>
doesn’t check the bounds of the buffer, introducing an exploitable
buffer overflow. It’s so obvious that both the compiler and linker
will yell about it.</p>

<p>For simplicity, suppose the program also contains a dangerous
function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">self_destruct</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"**** GO BOOM! ****"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The attacker can use the buffer overflow to call this dangerous
function.</p>

<p>To make this attack simpler for the sake of the article, assume the
program isn’t using ASLR (e.g. without <code class="language-plaintext highlighter-rouge">-fpie</code>/<code class="language-plaintext highlighter-rouge">-pie</code>, or with
<code class="language-plaintext highlighter-rouge">-fno-pie</code>/<code class="language-plaintext highlighter-rouge">-no-pie</code>). For this particular example, I’ll also
explicitly disable buffer overflow protections (e.g. <code class="language-plaintext highlighter-rouge">_FORTIFY_SOURCE</code>
and stack protectors).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
      -o demo demo.c
</code></pre></div></div>

<p>First, find the address of <code class="language-plaintext highlighter-rouge">self_destruct()</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct
</code></pre></div></div>

<p>This is on x86-64, so it’s a 64-bit address. The size of the <code class="language-plaintext highlighter-rouge">name</code>
buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes
allocated above, so there’s 16 bytes to fill, then 8 bytes to
overwrite the return pointer with the address of <code class="language-plaintext highlighter-rouge">self_destruct</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' &gt; boom
$ ./demo &lt; boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault
</code></pre></div></div>

<p>With this input I’ve successfully exploited the buffer overflow to
divert control to <code class="language-plaintext highlighter-rouge">self_destruct()</code>. When <code class="language-plaintext highlighter-rouge">main</code> tries to return into
libc, it instead jumps to the dangerous function, and then crashes
when that function tries to return — though, presumably, the system
would have self-destructed already. Turning on the stack protector
stops this exploit.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
      -o demo demo.c
$ ./demo &lt; boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...
</code></pre></div></div>

<p>The stack protector successfully blocks the exploit. To get around
this, I’d have to either guess the canary value or discover an
information leak that reveals it.</p>

<p>The stack protector transformed the program into something that looks
like the following:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">__canary</span> <span class="o">=</span> <span class="n">__get_thread_canary</span><span class="p">();</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">name</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">__canary</span> <span class="o">!=</span> <span class="n">__get_thread_canary</span><span class="p">())</span>
        <span class="n">abort</span><span class="p">();</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, it’s not actually possible to implement the stack protector
within C. Buffer overflows are undefined behavior, and a canary is
only affected by a buffer overflow, allowing the compiler to optimize
it away.</p>

<h3 id="function-pointers-and-virtual-functions">Function pointers and virtual functions</h3>

<p>After the attacker successfully self-destructed the last computer,
upper management has mandated password checks before all
self-destruction procedures. Here’s what it looks like now:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">self_destruct</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">password</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="s">"12345"</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">puts</span><span class="p">(</span><span class="s">"**** GO BOOM! ****"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The password is hardcoded, and it’s the kind of thing an idiot would
have on his luggage, but assume it’s actually unknown to the attacker.
Especially since, as I’ll show shortly, it won’t matter. Upper
management has also mandated stack protectors, so assume that’s
enabled from here on.</p>

<p>Additionally, the program has evolved a bit, and now <a href="/blog/2014/10/21/">uses a function
pointer for polymorphism</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">greeter</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">greet</span><span class="p">)(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>

<span class="kt">void</span>
<span class="nf">greet_hello</span><span class="p">(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Hello, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">greet_aloha</span><span class="p">(</span><span class="k">struct</span> <span class="n">greeter</span> <span class="o">*</span><span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"Aloha, %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">g</span><span class="o">-&gt;</span><span class="n">name</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s now a greeter object and the function pointer makes its
behavior polymorphic. Think of it as a hand-coded virtual function for
C. Here’s the new (contrived) <code class="language-plaintext highlighter-rouge">main</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">greeter</span> <span class="n">greeter</span> <span class="o">=</span> <span class="p">{.</span><span class="n">greet</span> <span class="o">=</span> <span class="n">greet_hello</span><span class="p">};</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">greeter</span><span class="p">.</span><span class="n">name</span><span class="p">);</span>
    <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">(</span><span class="o">&amp;</span><span class="n">greeter</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(In a real program, something else provides <code class="language-plaintext highlighter-rouge">greeter</code> and picks its
own function pointer for <code class="language-plaintext highlighter-rouge">greet</code>.)</p>

<p>Rather than overwriting the return pointer, the attacker has the
opportunity to overwrite the function pointer on the struct. Let’s
reconstruct the exploit like before.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct
</code></pre></div></div>

<p>We don’t know the password, but we <em>do</em> know (from peeking at the
disassembly) that the password check is 16 bytes. The attack should
instead jump 16 bytes into the function, skipping over the check
(0x4006a5 + 16 = 0x4006b5).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' &gt; boom
$ ./demo &lt; boom
**** GO BOOM! ****
</code></pre></div></div>

<p>Neither the stack protector nor the password were of any help. The
stack protector only protects the <em>return</em> pointer, not the function
pointer on the struct.</p>

<p><strong>This is where the Control Flow Guard comes into play.</strong> With CFG
enabled, the compiler inserts a check before calling the <code class="language-plaintext highlighter-rouge">greet()</code>
function pointer. It must point to the beginning of a known function,
otherwise it will abort just like the stack protector. Since the
middle of <code class="language-plaintext highlighter-rouge">self_destruct()</code> isn’t the <em>beginning</em> of a function, it
would abort if this exploit is attempted.</p>

<p>However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll
implement it myself, with manual checks.</p>

<h3 id="function-address-bitmap">Function address bitmap</h3>

<p>As described in the PDF linked at the top of this article, CFG on
Windows is implemented using a bitmap. Each bit in the bitmap
represents 8 bytes of memory. If those 8 bytes contains the beginning
of a function, the bit will be set to one. Checking a pointer means
checking its associated bit in the bitmap.</p>

<p>For my CFG, I’ve decided to keep the same 8-byte resolution: the
bottom three bits of the target address will be dropped. The next 24
bits will be used to index into the bitmap. All other bits in the
pointer will be ignored. A 24-bit bit index means the bitmap will only
be 2MB.</p>

<p>These 24 bits is perfectly sufficient for 32-bit systems, but it means
on 64-bit systems there may be false positives: some addresses will
not represent the start of a function, but will have their bit set
to 1. This is acceptable, especially because only functions known to
be targets of indirect calls will be registered in the table, reducing
the false positive rate.</p>

<p>Note: Relying on <a href="/blog/2016/05/30/">the bits of a pointer cast to an integer is
unspecified</a> and isn’t portable, but this implementation will
work fine anywhere I would care to use it.</p>

<p>Here are the CFG parameters. I’ve made them macros so that they can
easily be tuned at compile-time. The <code class="language-plaintext highlighter-rouge">cfg_bits</code> is the integer type
backing the bitmap array. The <code class="language-plaintext highlighter-rouge">CFG_RESOLUTION</code> is the number of bits
dropped, so “3” is a granularity of 8 bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">cfg_bits</span><span class="p">;</span>
<span class="cp">#define CFG_RESOLUTION  3
#define CFG_BITS        24
</span></code></pre></div></div>

<p>Given a function pointer <code class="language-plaintext highlighter-rouge">f</code>, this macro extracts the bitmap index.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define CFG_INDEX(f) \
    (((uintptr_t)f &gt;&gt; CFG_RESOLUTION) &amp; ((1UL &lt;&lt; CFG_BITS) - 1))
</span></code></pre></div></div>

<p>The CFG bitmap is just an array of integers. Zero it to initialize.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cfg</span> <span class="p">{</span>
    <span class="n">cfg_bits</span> <span class="n">bitmap</span><span class="p">[(</span><span class="mi">1UL</span> <span class="o">&lt;&lt;</span> <span class="n">CFG_BITS</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">)];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Functions are manually registered in the bitmap using
<code class="language-plaintext highlighter-rouge">cfg_register()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">cfg_register</span><span class="p">(</span><span class="k">struct</span> <span class="n">cfg</span> <span class="o">*</span><span class="n">cfg</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">CFG_INDEX</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">z</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">;</span>
    <span class="n">cfg</span><span class="o">-&gt;</span><span class="n">bitmap</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="n">z</span><span class="p">]</span> <span class="o">|=</span> <span class="mi">1UL</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="n">z</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because functions are registered at run-time, it’s fully compatible
with ASLR. If ASLR is enabled, the bitmap will be a little different
each run. On the same note, it may be worth XORing each bitmap element
with a random, run-time value — along the same lines as the stack
canary value — to make it harder for an attacker to manipulate the
bitmap should he get the ability to overwrite it by a vulnerability.
Alternatively the bitmap could be switched to read-only (e.g.
<code class="language-plaintext highlighter-rouge">mprotect()</code>) once everything is registered.</p>

<p>And finally, the check function, used immediately before indirect
calls. It ensures <code class="language-plaintext highlighter-rouge">f</code> was previously passed to <code class="language-plaintext highlighter-rouge">cfg_register()</code>
(except for false positives, as discussed). Since it will be invoked
often, it needs to be fast and simple.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">cfg_check</span><span class="p">(</span><span class="k">struct</span> <span class="n">cfg</span> <span class="o">*</span><span class="n">cfg</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="n">CFG_INDEX</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">z</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">cfg_bits</span><span class="p">)</span> <span class="o">*</span> <span class="n">CHAR_BIT</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">((</span><span class="n">cfg</span><span class="o">-&gt;</span><span class="n">bitmap</span><span class="p">[</span><span class="n">i</span> <span class="o">/</span> <span class="n">z</span><span class="p">]</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="n">i</span> <span class="o">%</span> <span class="n">z</span><span class="p">))</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">))</span>
        <span class="n">abort</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it! Now augment <code class="language-plaintext highlighter-rouge">main</code> to make use of it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">cfg</span> <span class="n">cfg</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">self_destruct</span><span class="p">);</span>  <span class="c1">// to prove this works</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greet_hello</span><span class="p">);</span>
    <span class="n">cfg_register</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greet_aloha</span><span class="p">);</span>

    <span class="k">struct</span> <span class="n">greeter</span> <span class="n">greeter</span> <span class="o">=</span> <span class="p">{.</span><span class="n">greet</span> <span class="o">=</span> <span class="n">greet_hello</span><span class="p">};</span>
    <span class="n">gets</span><span class="p">(</span><span class="n">greeter</span><span class="p">.</span><span class="n">name</span><span class="p">);</span>
    <span class="n">cfg_check</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cfg</span><span class="p">,</span> <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">);</span>
    <span class="n">greeter</span><span class="p">.</span><span class="n">greet</span><span class="p">(</span><span class="o">&amp;</span><span class="n">greeter</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And now attempting the exploit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./demo &lt; boom
Aborted
</code></pre></div></div>

<p>Normally <code class="language-plaintext highlighter-rouge">self_destruct()</code> wouldn’t be registered since it’s not a
legitimate target of an indirect call, but the exploit <em>still</em> didn’t
work because it called into the middle of <code class="language-plaintext highlighter-rouge">self_destruct()</code>, which
isn’t a valid address in the bitmap. The check aborts the program
before it can be exploited.</p>

<p>In a real application I would have a <a href="/blog/2016/12/23/">global <code class="language-plaintext highlighter-rouge">cfg</code> bitmap</a> for
the whole program, and define <code class="language-plaintext highlighter-rouge">cfg_check()</code> in a header as an <code class="language-plaintext highlighter-rouge">inline</code>
function.</p>

<p>Despite being possible implement in straight C without the help of the
toolchain, it would be far less cumbersome and error-prone to let the
compiler and platform handle Control Flow Guard. That’s the right
place to implement it.</p>

<p><em>Update</em>: Ted Unangst pointed out <a href="http://www.tedunangst.com/inks/l/849">OpenBSD performing a similar
check</a> in its mbuf library. Instead of a bitmap, the function
pointer is replaced with an index into an array of registered function
pointers. That approach is cleaner, more efficient, completely
portable, and has no false positives.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>C Closures as a Library</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/01/08/"/>
    <id>urn:uuid:a5f897bc-0510-3164-a949-fcb848d9279b</id>
    <updated>2017-01-08T22:45:38Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>A common idiom is C is the callback function pointer, either to
deliver information (i.e. a <em>visitor</em> or <em>handler</em>) or to customize
the function’s behavior (e.g. a comparator). Examples of the latter in
the C standard library are <code class="language-plaintext highlighter-rouge">qsort()</code> and <code class="language-plaintext highlighter-rouge">bsearch()</code>, each requiring a
comparator function in order to operate on arbitrary types.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">bsearch</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span>
              <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
              <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>A problem with these functions is that there’s no way to pass context
to the callback. The callback may need information beyond the two
element pointers when making its decision, or to <a href="/blog/2016/09/05/">update a
result</a>. For example, suppose I have a structure representing a
two-dimensional coordinate, and a coordinate distance function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">y</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">float</span>
<span class="nf">distance</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dx</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">x</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">x</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">a</span><span class="o">-&gt;</span><span class="n">y</span> <span class="o">-</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">y</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">dx</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">+</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If I have an array of coordinates and I want to sort them based on
their distance from some target, the comparator needs to know the
target. However, the <code class="language-plaintext highlighter-rouge">qsort()</code> interface has no way to directly pass
this information. Instead it has to be passed by another means, such
as a global variable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">coord_cmp</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">size_t</span> <span class="n">ncoords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">coords</span> <span class="o">*</span><span class="n">coords</span> <span class="o">=</span> <span class="cm">/* ... */</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">current_target</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
    <span class="c1">// ...</span>
    <span class="n">target</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span>
    <span class="nf">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp</span><span class="p">);</span>
</code></pre></div></div>

<p>Potential problems are that it’s neither thread-safe nor re-entrant.
Two different threads cannot use this comparator <a href="/blog/2014/10/12/">at the same
time</a>. Also, on some platforms and configurations, repeatedly
accessing a global variable in a comparator <a href="/blog/2016/12/23/">may have a significant
cost</a>. A common workaround for thread safety is to make the
global variable thread-local by allocating it in thread-local storage
(TLS):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">_Thread_local</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>       <span class="c1">// C11</span>
<span class="kr">__thread</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>            <span class="c1">// GCC and Clang</span>
<span class="kr">__declspec</span><span class="p">(</span><span class="kr">thread</span><span class="p">)</span> <span class="k">struct</span> <span class="n">coord</span> <span class="o">*</span><span class="n">target</span><span class="p">;</span>  <span class="c1">// Visual Studio</span>
</code></pre></div></div>

<p>This makes the comparator thread-safe. However, it’s still not
re-entrant (usually unimportant) and accessing thread-local variables
on some platforms is even more expensive — which is the situation for
Pthreads TLS, though not a problem for native x86-64 TLS.</p>

<p>Modern libraries usually provide some sort of “user data” pointer — a
generic pointer that is passed to the callback function as an
additional argument. For example, the GNU C Library has long had
<code class="language-plaintext highlighter-rouge">qsort_r()</code>: <em>re-entrant</em> qsort.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort_r</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">),</span>
           <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>The new comparator looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">coord_cmp_r</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">dist_a</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="kt">float</span> <span class="n">dist_b</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">target</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&lt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">dist_a</span> <span class="o">&gt;</span> <span class="n">dist_b</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">else</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And its usage:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">;</span>
    <span class="n">qsort_r</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">coord_cmp_r</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>User data arguments are thread-safe, re-entrant, performant, and
perfectly portable. They completely and cleanly solve the entire
problem with virtually no drawbacks. If every library did this, there
would be nothing left to discuss and this article would be boring.</p>

<h3 id="the-closure-solution">The closure solution</h3>

<p>In order to make things more interesting, suppose you’re stuck calling
a function in some old library that takes a callback but doesn’t
support a user data argument. A global variable is insufficient, and
the thread-local storage solution isn’t viable for one reason or
another. What do you do?</p>

<p>The core problem is that a function pointer is just an address, and
it’s the same address no matter the context for any particular
callback. On any particular call, the callback has three ways to
distinguish this call from other calls. These align with the three
solutions above:</p>

<ol>
  <li>Inspect some global state: the <strong>global variable solution</strong>. The
caller will change this state for some other calls.</li>
  <li>Query its unique thread ID: the <strong>thread-local storage solution</strong>.
Calls on different threads will have different thread IDs.</li>
  <li>Examine a context argument: the <strong>user pointer solution</strong>.</li>
</ol>

<p>A wholly different approach is to <strong>use a unique function pointer for
each callback</strong>. The callback could then inspect its own address to
differentiate itself from other callbacks. Imagine defining multiple
instances of <code class="language-plaintext highlighter-rouge">coord_cmp</code> each getting their context from a different
global variable. Using a unique copy of <code class="language-plaintext highlighter-rouge">coord_cmp</code> on each thread for
each usage would be both re-entrant and thread-safe, and wouldn’t
require TLS.</p>

<p>Taking this idea further, I’d like to <strong>generate these new functions
on demand at run time</strong> akin to a JIT compiler. This can be done as a
library, mostly agnostic to the implementation of the callback. Here’s
an example of what its usage will be like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">);</span>
<span class="kt">void</span>  <span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>The callback to be converted into a closure is <code class="language-plaintext highlighter-rouge">f</code> and the number of
arguments it takes is <code class="language-plaintext highlighter-rouge">nargs</code>. A new closure is allocated and returned
as a function pointer. This closure takes <code class="language-plaintext highlighter-rouge">nargs - 1</code> arguments, and
it will call the original callback with the additional argument
<code class="language-plaintext highlighter-rouge">userdata</code>.</p>

<p>So, for example, this code uses a closure to convert <code class="language-plaintext highlighter-rouge">coord_cmp_r</code>
into a function suitable for <code class="language-plaintext highlighter-rouge">qsort()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">closure</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
<span class="n">closure</span> <span class="o">=</span> <span class="n">closure_create</span><span class="p">(</span><span class="n">coord_cmp_r</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">current_target</span><span class="p">);</span>

<span class="n">qsort</span><span class="p">(</span><span class="n">coords</span><span class="p">,</span> <span class="n">ncoords</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">closure</span><span class="p">);</span>

<span class="n">closure_destroy</span><span class="p">(</span><span class="n">closure</span><span class="p">);</span>
</code></pre></div></div>

<p><strong>Caveat</strong>: This API is <em>utterly insufficient</em> for any sort of
portability. The number of arguments isn’t nearly enough information
for the library to generate a closure. For practically every
architecture and ABI, it’s going to depend on the types of each of
those arguments. On x86-64 with the System V ABI — where I’ll be
implementing this — this argument will only count integer/pointer
arguments. To find out what it takes to do this properly, see the
<a href="https://www.gnu.org/software/libjit/">libjit</a> documentation.</p>

<h3 id="memory-design">Memory design</h3>

<p>This implementation will be for x86-64 Linux, though the high level
details will be the same for any program running in virtual memory. My
closures will span exactly two consecutive pages (typically 8kB),
though it’s possible to use exactly one page depending on the desired
trade-offs. The reason I need two pages are because each page will
have different protections.</p>

<p><img src="/img/diagram/closure-pages.svg" alt="" /></p>

<p>Native code — the <em>thunk</em> — lives in the upper page. The user data
pointer and callback function pointer lives at the high end of the
lower page. The two pointers could really be anywhere in the lower
page, and they’re only at the end for aesthetic reasons. The thunk
code will be identical for all closures of the same number of
arguments.</p>

<p>The upper page will be executable and the lower page will be writable.
This allows new pointers to be set without writing to executable thunk
memory. In the future I expect operating systems to enforce W^X
(“write xor execute”), and this code will already be compliant.
Alternatively, the pointers could be “baked in” with the thunk page
and immutable, but since creating closure requires two system calls, I
figure it’s better that the pointers be mutable and the closure object
reusable.</p>

<p>The address for the closure itself will be the upper page, being what
other functions will call. The thunk will load the user data pointer
from the lower page as an additional argument, then jump to the
actual callback function also given by the lower page.</p>

<h3 id="thunk-assembly">Thunk assembly</h3>

<p>The x86-64 thunk assembly for a 2-argument closure calling a
3-argument callback looks like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">user:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="nl">func:</span>  <span class="kd">dq</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">mov</span>  <span class="nb">rdx</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">user</span><span class="p">]</span>
        <span class="nf">jmp</span>  <span class="p">[</span><span class="nv">rel</span> <span class="nv">func</span><span class="p">]</span>
</code></pre></div></div>

<p>As a reminder, the integer/pointer argument register order for the
System V ABI calling convention is: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rsi</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">r8</code>,
<code class="language-plaintext highlighter-rouge">r9</code>. The third argument is passed through <code class="language-plaintext highlighter-rouge">rdx</code>, so the user pointer
is loaded into this register. Then it jumps to the callback address
with the original arguments still in place, plus the new argument. The
<code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">func</code> values are loaded <em>RIP-relative</em> (<code class="language-plaintext highlighter-rouge">rel</code>) to the
address of the code. The thunk is using the callback address (its own
address) to determine the context.</p>

<p>The assembled machine code for the thunk is just 13 bytes:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk2</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1">// mov  rdx, [rel user]</span>
    <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
    <span class="c1">// jmp  [rel func]</span>
    <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
<span class="p">}</span>
</code></pre></div></div>

<p>All <code class="language-plaintext highlighter-rouge">closure_create()</code> has to do is allocate two pages, copy this
buffer into the upper page, adjust the protections, and return the
address of the thunk. Since <code class="language-plaintext highlighter-rouge">closure_create()</code> will work for <code class="language-plaintext highlighter-rouge">nargs</code>
number of arguments, there will actually be 6 slightly different
thunks, one for each of the possible register arguments (<code class="language-plaintext highlighter-rouge">rdi</code> through
<code class="language-plaintext highlighter-rouge">r9</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">thunk</span><span class="p">[</span><span class="mi">6</span><span class="p">][</span><span class="mi">13</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x3d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x35</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x15</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x05</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span> <span class="p">{</span>
        <span class="mh">0x4C</span><span class="p">,</span> <span class="mh">0x8b</span><span class="p">,</span> <span class="mh">0x0d</span><span class="p">,</span> <span class="mh">0xe9</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span>
        <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0x25</span><span class="p">,</span> <span class="mh">0xeb</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">,</span> <span class="mh">0xff</span>
    <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Given a closure pointer returned from <code class="language-plaintext highlighter-rouge">closure_create()</code>, here are the
setter functions for setting the closure’s two pointers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_set_data</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">closure_set_function</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">closure</span><span class="p">;</span>
    <span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">closure_create()</code>, allocation is done with an anonymous <code class="language-plaintext highlighter-rouge">mmap()</code>,
just like in <a href="/blog/2015/03/19/">my JIT compiler</a>. It’s initially mapped writable in
order to copy the thunk, then the thunk page is set to executable.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">closure_create</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="kt">int</span> <span class="n">nargs</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">userdata</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="kt">void</span> <span class="o">*</span><span class="n">closure</span> <span class="o">=</span> <span class="n">p</span> <span class="o">+</span> <span class="n">page_size</span><span class="p">;</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">thunk</span><span class="p">[</span><span class="n">nargs</span> <span class="o">-</span> <span class="mi">1</span><span class="p">],</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">thunk</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>

    <span class="n">closure_set_function</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">f</span><span class="p">);</span>
    <span class="n">closure_set_data</span><span class="p">(</span><span class="n">closure</span><span class="p">,</span> <span class="n">userdata</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">closure</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Destroying a closure is done by computing the lower page address and
calling <code class="language-plaintext highlighter-rouge">munmap()</code> on it:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">closure_destroy</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">closure</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">page_size</span> <span class="o">=</span> <span class="n">sysconf</span><span class="p">(</span><span class="n">_SC_PAGESIZE</span><span class="p">);</span>
    <span class="n">munmap</span><span class="p">((</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">closure</span> <span class="o">-</span> <span class="n">page_size</span><span class="p">,</span> <span class="n">page_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And that’s it! You can see the entire demo here:</p>

<ul>
  <li><a href="/download/closure-demo.c" class="download">closure-demo.c</a></li>
</ul>

<p>It’s a lot simpler for x86-64 than it is for x86, where there’s no
RIP-relative addressing and arguments are passed on the stack. The
arguments must all be copied back onto the stack, above the new
argument, and it cannot be a tail call since the stack has to be fixed
before returning. Here’s what the thunk looks like for a 2-argument
closure:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">data:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="nl">func:</span>	<span class="kd">dd</span> <span class="mi">0</span>
<span class="c1">;; --- page boundary here ---</span>
<span class="nl">thunk2:</span>
        <span class="nf">call</span> <span class="nv">.rip2eax</span>
<span class="nl">.rip2eax:</span>
        <span class="nf">pop</span> <span class="nb">eax</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">13</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">push</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">esp</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span>
        <span class="nf">call</span> <span class="p">[</span><span class="nb">eax</span> <span class="o">-</span> <span class="mi">9</span><span class="p">]</span>
        <span class="nf">add</span> <span class="nb">esp</span><span class="p">,</span> <span class="mi">12</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Exercise for the reader: Port the closure demo to a different
architecture or to the the Windows x64 ABI.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Relocatable Global Data on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/12/23/"/>
    <id>urn:uuid:56be19e0-ce9a-3f37-dc85-578f397ed3e1</id>
    <updated>2016-12-23T22:50:51Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Relocatable code — program code that executes correctly from any
properly-aligned address — is an essential feature for shared
libraries. Otherwise all of a system’s shared libraries would need to
coordinate their virtual load addresses. Loading programs and
libraries to random addresses is also a valuable security feature:
Address Space Layout Randomization (ASLR). But how does a compiler
generate code for a function that accesses a global variable if that
variable’s address isn’t known at compile time?</p>

<p>Consider this simple C code sample.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function needs the base address of <code class="language-plaintext highlighter-rouge">values</code> in order to
dereference it for <code class="language-plaintext highlighter-rouge">values[x]</code>. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -Os -fPIC get_value.c
</code></pre></div></div>

<p>I optimized for size (<code class="language-plaintext highlighter-rouge">-Os</code>) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with <code class="language-plaintext highlighter-rouge">objdump</code>. Alternatively I
could have asked for the compiler’s assembly output with <code class="language-plaintext highlighter-rouge">-S</code>, but
this will be good reverse engineering practice.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -d -Mintel get_value.o
0000000000000000 &lt;get_value&gt;:
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 &lt;get_value+0x16&gt;
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret
</code></pre></div></div>

<p>There are a couple of interesting things going on, but let’s start
from the beginning.</p>

<ol>
  <li>
    <p><a href="https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-secure.pdf">The ABI</a> specifies that the first integer/pointer argument
(the 32-bit integer <code class="language-plaintext highlighter-rouge">x</code>) is passed through the <code class="language-plaintext highlighter-rouge">edi</code> register. The
function compares <code class="language-plaintext highlighter-rouge">x</code> to 3, to satisfy <code class="language-plaintext highlighter-rouge">x &lt; 4</code>.</p>
  </li>
  <li>
    <p>The ABI specifies that floating point values are returned through
the <a href="/blog/2015/07/10/">SSE2 SIMD register</a> <code class="language-plaintext highlighter-rouge">xmm0</code>. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of <code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>It then uses the result of the previous comparison to perform a
jump, <code class="language-plaintext highlighter-rouge">ja</code> (“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to <code class="language-plaintext highlighter-rouge">cmp</code>
(<code class="language-plaintext highlighter-rouge">edi</code>) comes after the first operand (<code class="language-plaintext highlighter-rouge">0x3</code>) as <em>unsigned</em> values.
Its cousin, <code class="language-plaintext highlighter-rouge">jg</code> (“jump if greater”), is for signed values. If <code class="language-plaintext highlighter-rouge">x</code>
is outside the array bounds, it jumps straight to <code class="language-plaintext highlighter-rouge">ret</code>, returning
<code class="language-plaintext highlighter-rouge">0.0f</code>.</p>
  </li>
  <li>
    <p>If <code class="language-plaintext highlighter-rouge">x</code> was in bounds, it uses a <code class="language-plaintext highlighter-rouge">lea</code> (“load effective address”) to
load <em>something</em> into the 64-bit <code class="language-plaintext highlighter-rouge">rax</code> register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into <code class="language-plaintext highlighter-rouge">rax</code> is the address of the <code class="language-plaintext highlighter-rouge">values</code> array. More on
this in a moment.</p>
  </li>
  <li>
    <p>Finally it uses <code class="language-plaintext highlighter-rouge">x</code> as an index into address in <code class="language-plaintext highlighter-rouge">rax</code>. The <code class="language-plaintext highlighter-rouge">movss</code>
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of <code class="language-plaintext highlighter-rouge">xmm0</code>, where the caller expects to find the
return value. This is all preceded by a <code class="language-plaintext highlighter-rouge">mov edi, edi</code> which
<a href="/blog/2016/03/31/"><em>looks</em> like a hotpatch nop</a>, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses <code class="language-plaintext highlighter-rouge">rdi</code> not <code class="language-plaintext highlighter-rouge">edi</code>.
All 32-bit register assignments clear the upper 32 bits, and so
this <code class="language-plaintext highlighter-rouge">mov</code> zero-extends <code class="language-plaintext highlighter-rouge">edi</code> into <code class="language-plaintext highlighter-rouge">rdi</code>. This is in case of the
unlikely event that the caller left garbage in those upper bits.</p>
  </li>
</ol>

<h3 id="clearing-xmm0">Clearing <code class="language-plaintext highlighter-rouge">xmm0</code></h3>

<p>The first interesting part: <code class="language-plaintext highlighter-rouge">xmm0</code> is cleared even when its first lane
is loaded with a value. There are two reasons to do this.</p>

<p>The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra <code class="language-plaintext highlighter-rouge">ret</code> or an conditional <code class="language-plaintext highlighter-rouge">jmp</code> over the “else” branch.</p>

<p>The less obvious reason is that it breaks a <em>data dependency</em>. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called <a href="https://en.wikipedia.org/wiki/Register_renaming">register renaming</a>. <em>Architectural
registers</em> (<code class="language-plaintext highlighter-rouge">rax</code>, <code class="language-plaintext highlighter-rouge">edi</code>, etc.) are just temporary names for
underlying <em>physical registers</em>. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.</p>

<p>For example, take this assembly sample. It assembles to 9 bytes of
machine code.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">ecx</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>This reads a 32-bit value from the address stored in <code class="language-plaintext highlighter-rouge">rcx</code>, then
assigns <code class="language-plaintext highlighter-rouge">ecx</code> and uses <code class="language-plaintext highlighter-rouge">cl</code> (the lowest byte of <code class="language-plaintext highlighter-rouge">rcx</code>) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of <code class="language-plaintext highlighter-rouge">rcx</code>, wiping the unused parts of
register.</p>

<p>So after the second instruction, it’s guaranteed that the value in
<code class="language-plaintext highlighter-rouge">rcx</code> has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, <em>before</em> the load. Ingenious!</p>

<p>Compare it to this example, where the second instruction assigns to
<code class="language-plaintext highlighter-rouge">cl</code> instead of <code class="language-plaintext highlighter-rouge">ecx</code>. This assembles to just 6 bytes.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span>  <span class="nb">edi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
    <span class="nf">mov</span>  <span class="nb">cl</span><span class="p">,</span> <span class="mi">7</span>
    <span class="nf">shl</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">cl</span>
</code></pre></div></div>

<p>The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of <code class="language-plaintext highlighter-rouge">rcx</code> still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.</p>

<p>By clearing <code class="language-plaintext highlighter-rouge">xmm0</code>, instructions in <code class="language-plaintext highlighter-rouge">get_value</code> involving <code class="language-plaintext highlighter-rouge">xmm0</code> have
the opportunity to be executed prior to instructions in the callee
that use <code class="language-plaintext highlighter-rouge">xmm0</code>.</p>

<h3 id="rip-relative-addressing">RIP-relative addressing</h3>

<p>Going back to the instruction that computes the address of <code class="language-plaintext highlighter-rouge">values</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
</code></pre></div></div>

<p>Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
<em>RIP-relative addressing</em> that’s still the case, but the register with
the absolute address is <code class="language-plaintext highlighter-rouge">rip</code>, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.</p>

<p>That means this instruction copies the instruction pointer (pointing
to the next instruction) into <code class="language-plaintext highlighter-rouge">rax</code>, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you <em>want</em> a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a <em>relocation entry</em> to the object file so that the linker knows how
to do this.</p>

<p>On platforms that <a href="/blog/2016/11/17/">use ELF</a> we can inspect relocations this with
<code class="language-plaintext highlighter-rouge">readelf</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
</code></pre></div></div>

<p>The relocation type is <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code>. In the <a href="http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_amd64.pdf">AMD64 Architecture
Processor Supplement</a>, this is defined as “S + A - P”.</p>

<ul>
  <li>
    <p>S: Represents the value of the symbol whose index resides in the
relocation entry.</p>
  </li>
  <li>
    <p>A: Represents the addend used to compute the value of the
relocatable field.</p>
  </li>
  <li>
    <p>P: Represents the place of the storage unit being relocated.</p>
  </li>
</ul>

<p>The symbol, S, is <code class="language-plaintext highlighter-rouge">.rodata</code> — the final address for this object file’s
portion of <code class="language-plaintext highlighter-rouge">.rodata</code> (where <code class="language-plaintext highlighter-rouge">values</code> resides). The addend, A, is <code class="language-plaintext highlighter-rouge">-4</code>
since the instruction pointer points at the <em>next</em> instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the <code class="language-plaintext highlighter-rouge">lea</code> instruction. These values are all known at
link-time, so no run-time support is necessary.</p>

<p>Being “S - P” (overall), this will be the displacement between these
two addresses: the 32-bit value is relative. It’s relocatable so long
as these two parts of the binary (code and data) maintain a fixed
distance from each other. The binary is relocated as a whole, so this
assumption holds.</p>

<h3 id="32-bit-relocation">32-bit relocation</h3>

<p>Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the <code class="language-plaintext highlighter-rouge">-m32</code> flag for a 32-bit target, and <code class="language-plaintext highlighter-rouge">-fomit-frame-pointer</code> to
make it simpler for explanatory purposes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 &lt;get_value&gt;:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 &lt;get_value+0x7&gt;
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f &lt;get_value+0x1f&gt;
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 &lt;__x86.get_pc_thunk.cx&gt;:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret
</code></pre></div></div>

<p>Hmm, this one includes an extra function.</p>

<ol>
  <li>
    <p>In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, <code class="language-plaintext highlighter-rouge">x</code>, into <code class="language-plaintext highlighter-rouge">eax</code>.</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">fldz</code> instruction clears the x87 floating pointer return
register, just like clearing <code class="language-plaintext highlighter-rouge">xmm0</code> in the x86-64 version.</p>
  </li>
  <li>
    <p>Next it calls <code class="language-plaintext highlighter-rouge">__x86.get_pc_thunk.cx</code>. The call pushes the
instruction pointer, <code class="language-plaintext highlighter-rouge">eip</code>, onto the stack. This function reads
that value off the stack into <code class="language-plaintext highlighter-rouge">ecx</code> and returns. In other words,
calling this function copies <code class="language-plaintext highlighter-rouge">eip</code> into <code class="language-plaintext highlighter-rouge">ecx</code>. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.</p>
  </li>
  <li>
    <p>Next a 32-bit displacement is added to <code class="language-plaintext highlighter-rouge">ecx</code>. In this case it’s
<code class="language-plaintext highlighter-rouge">2</code>, but, like before, this is actually going be filled in later by
the linker.</p>
  </li>
  <li>
    <p>Then it’s just like before: a branch to optionally load a value.
The floating pointer load (<code class="language-plaintext highlighter-rouge">fld</code>) is another relocation.</p>
  </li>
</ol>

<p>Let’s look at the relocations. There are three this time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata
</code></pre></div></div>

<p>The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is <code class="language-plaintext highlighter-rouge">R_386_PC32</code>: a code-relative
displacement just like on x86-64.</p>

<p>The next is of type <code class="language-plaintext highlighter-rouge">R_386_GOTPC</code> and sets the second operand in that
<code class="language-plaintext highlighter-rouge">add ecx</code>. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since <code class="language-plaintext highlighter-rouge">values</code> is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.</p>

<p>The final relocation is of type <code class="language-plaintext highlighter-rouge">R_386_GOTOFF</code>. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, <code class="language-plaintext highlighter-rouge">fld</code>. Ultimately the load adds these last
two relocations together, canceling the GOT:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
</code></pre></div></div>

<p>So the GOT isn’t relevant in this case. It’s just a mechanism for
constructing a custom relocation type.</p>

<h3 id="branch-optimization">Branch optimization</h3>

<p>Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will <code class="language-plaintext highlighter-rouge">x</code> be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.</p>

<p>The typical way to provide such a compiler hint is with a pair of
macros, <code class="language-plaintext highlighter-rouge">likely()</code> and <code class="language-plaintext highlighter-rouge">unlikely()</code>. With GCC and Clang, these would
be defined to use <code class="language-plaintext highlighter-rouge">__builtin_expect</code>. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)
</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">float</span> <span class="n">values</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">,</span> <span class="mi">1</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="p">};</span>

<span class="kt">float</span> <span class="nf">get_value</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">unlikely</span><span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">)</span> <span class="o">?</span> <span class="n">values</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Unfortunately this makes no difference even in the latest version of
GCC. In Clang it changes branch fall-through (for <a href="http://www.agner.org/optimize/microarchitecture.pdf">static branch
prediction</a>), but still always calls the thunk. It seems
compilers <a href="https://ewontfix.com/18/">have difficulty</a> with <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232">optimizing relocatable
code</a> on x86.</p>

<h3 id="x86-64-isnt-just-about-more-memory">x86-64 isn’t just about more memory</h3>

<p>It’s commonly understood that the advantage of 64-bit versus 32-bit
systems is processes having access to more than 4GB of memory. But as
this shows, there’s more to it than that. Even programs that don’t
need that much memory can really benefit from newer features like
RIP-relative addressing.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Portable Structure Access with Member Offset Constants</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/22/"/>
    <id>urn:uuid:81ff4064-17f1-3a9b-a5ec-61acb03385b9</id>
    <updated>2016-11-22T12:55:29Z</updated>
    <category term="c"/><category term="posix"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>Suppose you need to write a C program to access a long sequence of
structures from a binary file in a specified format. These structures
have different lengths and contents, but also a common header
identifying its type and size. Here’s the definition of that header
(no padding):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">time</span><span class="p">;</span>   <span class="c1">// unix epoch (microseconds)</span>
    <span class="kt">uint32_t</span> <span class="n">size</span><span class="p">;</span>   <span class="c1">// including this header (bytes)</span>
    <span class="kt">uint16_t</span> <span class="n">source</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">type</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">size</code> member is used to find the offset of the next structure in
the file without knowing anything else about the current structure.
Just add <code class="language-plaintext highlighter-rouge">size</code> to the offset of the current structure.</p>

<p>The <code class="language-plaintext highlighter-rouge">type</code> member indicates what kind of data follows this structure.
The program is likely to <code class="language-plaintext highlighter-rouge">switch</code> on this value.</p>

<p>The actual structures might look something like this (in the spirit of
<a href="http://openxcom.org/">X-COM</a>). Note how each structure begins with <code class="language-plaintext highlighter-rouge">struct event</code> as
header. All angles are expressed using <a href="https://en.wikipedia.org/wiki/Binary_scaling">binary scaling</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30
</span>
<span class="k">struct</span> <span class="n">observer</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">latitude</span><span class="p">;</span>   <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">longitude</span><span class="p">;</span>  <span class="c1">//</span>
    <span class="kt">uint16_t</span> <span class="n">source_id</span><span class="p">;</span>  <span class="c1">// later used for event source</span>
    <span class="kt">uint16_t</span> <span class="n">name_size</span><span class="p">;</span>  <span class="c1">// not including null terminator</span>
    <span class="kt">char</span> <span class="n">name</span><span class="p">[];</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">ufo_sighting</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">azimuth</span><span class="p">;</span>    <span class="c1">// binary scaled angle</span>
    <span class="kt">uint32_t</span> <span class="n">elevation</span><span class="p">;</span>  <span class="c1">//</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">suspicious_signal</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">event</span> <span class="n">event</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">num_channels</span><span class="p">;</span>
    <span class="kt">uint16_t</span> <span class="n">sample_rate</span><span class="p">;</span>  <span class="c1">// Hz</span>
    <span class="kt">uint32_t</span> <span class="n">num_samples</span><span class="p">;</span>  <span class="c1">// per channel</span>
    <span class="kt">int16_t</span> <span class="n">samples</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>If all integers are stored in little endian byte order (least
significant byte first), there’s a strong temptation to lay the
structures directly over the data. After all, this will work correctly
on most computers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">event</span> <span class="n">header</span><span class="p">;</span>
<span class="n">fread</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">header</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">header</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This code will not work correctly when:</p>

<ol>
  <li>
    <p>The host machine doesn’t use little endian byte order, though this
is now uncommon. Sometimes developers will attempt to detect the
byte order at compile time and use the preprocessor to byte-swap if
needed. This is a mistake.</p>
  </li>
  <li>
    <p>The host machine has different alignment requirements and so
introduces additional padding to the structure. Sometimes this can
be resolved with a non-standard <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.4/gcc/Structure_002dPacking-Pragmas.html"><code class="language-plaintext highlighter-rouge">#pragma pack</code></a>.</p>
  </li>
</ol>

<h3 id="integer-extraction-functions">Integer extraction functions</h3>

<p>Fortunately it’s easy to write fast, correct, portable code for this
situation. First, define some functions to extract little endian
integers from an octet buffer (<code class="language-plaintext highlighter-rouge">uint8_t</code>). These will work correctly
regardless of the host’s alignment and byte order.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint16_t</span>
<span class="nf">extract_u16le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint16_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint64_t</span>
<span class="nf">extract_u64le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">56</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">48</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">40</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">32</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The big endian version is identical, but with shifts in reverse order.</p>

<p>A common concern is that these functions are a lot less efficient than
they could be. On x86 where alignment is very relaxed, each could be
implemented as a single load instruction. However, on GCC 4.x and
earlier, <code class="language-plaintext highlighter-rouge">extract_u32le</code> compiles to something like this:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">3</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">24</span>
        <span class="nf">mov</span>     <span class="nb">edx</span><span class="p">,</span> <span class="nb">eax</span>
        <span class="nf">movzx</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">2</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">eax</span><span class="p">,</span> <span class="mi">16</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">movzx</span>   <span class="nb">edx</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
        <span class="nf">sal</span>     <span class="nb">edx</span><span class="p">,</span> <span class="mi">8</span>
        <span class="nf">or</span>      <span class="nb">eax</span><span class="p">,</span> <span class="nb">edx</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>It’s tempting to fix the problem with the following definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: Don't do this.</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32le</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s unportable, it’s undefined behavior, and worst of all, it <a href="http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html">might
not work correctly even on x86</a>. Fortunately I have some great
news. On GCC 5.x and above, the correct definition compiles to the
desired, fast version. It’s the best of both worlds.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32le:</span>
        <span class="n">mov</span>     <span class="n">eax</span><span class="p">,</span> <span class="p">[</span><span class="n">rdi</span><span class="p">]</span>
        <span class="n">ret</span>
</code></pre></div></div>

<p>It’s even smart about the big endian version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span>
<span class="nf">extract_u32be</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">24</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span> <span class="o">|</span>
           <span class="p">(</span><span class="kt">uint32_t</span><span class="p">)</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">&lt;&lt;</span>  <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Is compiled to exactly what you’d want:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">mov</span>     <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">bswap</span>   <span class="nb">eax</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Or, even better, if your system supports <code class="language-plaintext highlighter-rouge">movbe</code> (<code class="language-plaintext highlighter-rouge">gcc -mmovbe</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">extract_u32be:</span>
        <span class="nf">movbe</span>   <span class="nb">eax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">]</span>
        <span class="nf">ret</span>
</code></pre></div></div>

<p>Unfortunately, Clang/LLVM is <em>not</em> this smart as of 3.9, but I’m
betting it will eventually learn how to do this, too.</p>

<h3 id="member-offset-constants">Member offset constants</h3>

<p>For this next technique, that <code class="language-plaintext highlighter-rouge">struct event</code> from above need not
actually be in the source. It’s purely documentation. Instead, let’s
define the structure in terms of <em>member offset constants</em> — a term I
just made up for this article. I’ve included the integer types as part
of the name to aid in their correct use.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14
</span></code></pre></div></div>

<p>Given a buffer, the integer extraction functions, and these offsets,
structure members can be plucked out on demand.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="kt">uint64_t</span> <span class="n">time</span>   <span class="o">=</span> <span class="n">extract_u64le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U64LE_TIME</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">size</span>   <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">;</span>
<span class="kt">uint16_t</span> <span class="n">source</span> <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_SOURCE</span><span class="p">);</span>
<span class="kt">uint16_t</span> <span class="n">type</span>   <span class="o">=</span> <span class="n">extract_u16le</span><span class="p">(</span><span class="n">buf</span> <span class="o">+</span> <span class="n">EVENT_U16LE_TYPE</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86 with GCC 5.x, each member access will be inlined and compiled
to a one-instruction extraction. As far as performance is concerned,
it’s identical to using a structure overlay, but this time the C code
is clean and portable. A slight downside is the lack of type checking
on member access: it’s easy to mismatch the types and accidentally
read garbage.</p>

<h3 id="memory-mapping-and-iteration">Memory mapping and iteration</h3>

<p>There’s a real advantage to memory mapping the input file and using
its contents directly. On a system with a huge virtual address space,
such as x86-64 or AArch64, this memory is almost “free.” Already being
backed by a file, paging out this memory costs nothing (i.e. it’s
discarded). The input file can comfortably be much larger than
physical memory without straining the system.</p>

<p>Unportable structure overlay can take advantage of memory mapping this
way, but has the previously-described issues. An approach with member
offset constants will take advantage of it just as well, all while
remaining clean and portable.</p>

<p>I like to wrap the memory mapping code into a simple interface, which
makes porting to non-POSIX platforms, such Windows, easier. Caveat:
This won’t work with files whose size exceeds the available contiguous
virtual memory of the system — a real problem for 32-bit systems.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/stat.h&gt;</span><span class="cp">
</span>
<span class="kt">uint8_t</span> <span class="o">*</span>
<span class="nf">map_file</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="kt">size_t</span> <span class="o">*</span><span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">struct</span> <span class="n">stat</span> <span class="n">stat</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fstat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">stat</span><span class="p">)</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="o">*</span><span class="n">length</span> <span class="o">=</span> <span class="n">stat</span><span class="p">.</span><span class="n">st_size</span><span class="p">;</span>  <span class="c1">// TODO: possible overflow</span>
    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o">*</span><span class="n">length</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">p</span> <span class="o">!=</span> <span class="n">MAP_FAILED</span> <span class="o">?</span> <span class="n">p</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">unmap_file</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Next, here’s an example that iterates over all the structures in
<code class="language-plaintext highlighter-rouge">input_file</code>, in this case counting each. The <code class="language-plaintext highlighter-rouge">size</code> member is
extracted in order to stride to the next structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">length</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">data</span> <span class="o">=</span> <span class="n">map_file</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">length</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">data</span><span class="p">)</span>
    <span class="n">FATAL</span><span class="p">();</span>

<span class="kt">size_t</span> <span class="n">event_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">data</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">&lt;</span> <span class="n">data</span> <span class="o">+</span> <span class="n">length</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">event_count</span><span class="o">++</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">extract_u32le</span><span class="p">(</span><span class="n">p</span> <span class="o">+</span> <span class="n">EVENT_U32LE_SIZE</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">length</span> <span class="o">-</span> <span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">data</span><span class="p">))</span>
        <span class="n">FATAL</span><span class="p">();</span>  <span class="c1">// invalid size</span>
    <span class="n">p</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"I see %zu events.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">event_count</span><span class="p">);</span>

<span class="n">unmap_file</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span>
</code></pre></div></div>

<p>This is the basic structure for navigating this kind of data. A deeper
dive would involve a <code class="language-plaintext highlighter-rouge">switch</code> inside the loop, extracting the relevant
members for whatever use is needed.</p>

<p>Fast, correct, simple. Pick three.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Baking Data with Serialization</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/15/"/>
    <id>urn:uuid:365d1301-72b9-39d1-8023-20fb83e046ab</id>
    <updated>2016-11-15T05:27:53Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Suppose you want to bake binary data directly into a program’s
executable. It could be image pixel data (PNG, BMP, JPEG), a text
file, or some sort of complex data structure. Perhaps the purpose is
to build a single executable with no extraneous data files — easier to
install and manage, though harder to modify. Or maybe you’re lazy and
don’t want to worry about handling the various complications and
errors that arise when reading external data: Where to find it, and
what to do if you can’t find it or can’t read it. This article is
about two different approaches I’ve used a number of times for C
programs.</p>

<h3 id="the-linker-approach">The linker approach</h3>

<p>The simpler, less portable option is to have the linker do it. Both
the GNU linker and the <a href="http://www.airs.com/blog/archives/38">gold linker</a> (ELF only) can create
object files from arbitrary files using the <code class="language-plaintext highlighter-rouge">--format</code> (<code class="language-plaintext highlighter-rouge">-b</code>) option
set to <code class="language-plaintext highlighter-rouge">binary</code> (raw data). It’s combined with <code class="language-plaintext highlighter-rouge">--relocatable</code> (<code class="language-plaintext highlighter-rouge">-r</code>)
to make it linkable with the rest of the program. MinGW supports all
of this, too, so it’s fairly portable so long as you stick to GNU
Binutils.</p>

<p>For example, to create an object file, <code class="language-plaintext highlighter-rouge">my_msg.o</code> with the
contents of the text file <code class="language-plaintext highlighter-rouge">my_msg.txt</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ld -r -b binary -o my_file.o my_msg.txt
</code></pre></div></div>

<p>(<em>Update</em>: <a href="/blog/2019/11/15/">You probably also want to use <code class="language-plaintext highlighter-rouge">-z noexecstack</code></a>.)</p>

<p>The object file will have three symbols, each named after the input
file. Unfortunately there’s no control over the symbol names, section
(.data), alignment, or protections (e.g. read-only). You’re completely
at the whim of the linker, short of objcopy tricks.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nm my_msg.o
000000000000000e D _binary_my_msg_txt_end
000000000000000e A _binary_my_msg_txt_size
0000000000000000 D _binary_my_msg_txt_start
</code></pre></div></div>

<p>To access these in C, declare them as global variables like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_end</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>
</code></pre></div></div>

<p>The size symbol, <code class="language-plaintext highlighter-rouge">_binary_my_msg_txt_size</code>, is misleading. The “A”
from nm means it’s an absolute symbol, not relocated. It doesn’t refer
to an integer that holds the size of the raw data. The value of the
symbol itself is the size of the data. That is, take the address of it
and cast it to an integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">&amp;</span><span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>
</code></pre></div></div>

<p>Alternatively — and this is my own preference — just subtract the
other two symbols. It’s cleaner and easier to understand.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">_binary_my_msg_txt_end</span> <span class="o">-</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">;</span>
</code></pre></div></div>

<p>Here’s the “Hello, world” for this approach (<code class="language-plaintext highlighter-rouge">hello.c</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_end</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">char</span> <span class="n">_binary_my_msg_txt_size</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">_binary_my_msg_txt_end</span> <span class="o">-</span> <span class="n">_binary_my_msg_txt_start</span><span class="p">;</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">_binary_my_msg_txt_start</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The program has to use <code class="language-plaintext highlighter-rouge">fwrite()</code> rather than <code class="language-plaintext highlighter-rouge">fputs()</code> because the
data won’t necessarily be null-terminated. That is, unless a null is
intentionally put at the end of the text file itself.</p>

<p>And for the build:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat my_msg.txt
Hello, world!
$ ld -r -b binary -o my_msg.o my_msg.txt
$ gcc -o hello hello.c my_msg.o
$ ./hello
Hello, world!
</code></pre></div></div>

<p>If this was binary data, such as an image file, the program would
instead read the array as if it were a memory mapped file. In fact,
that’s what it really is: the raw data memory mapped by the loader
before the program started.</p>

<h4 id="how-about-a-data-structure-dump">How about a data structure dump?</h4>

<p>This could be taken further to dump out some kinds of data structures.
For example, this program (<code class="language-plaintext highlighter-rouge">table_gen.c</code>) fills out a table of the
first 90 Fibonacci numbers and dumps it to standard output.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="cp">#define TABLE_SIZE 90
</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[</span><span class="n">TABLE_SIZE</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">};</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TABLE_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">2</span><span class="p">];</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">table</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Build and run this intermediate helper program as part of the overall
build.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen &gt; table.bin
$ ld -r -b binary -o table.o table.bin
</code></pre></div></div>

<p>And then the main program (<code class="language-plaintext highlighter-rouge">print_fib.c</code>) might look like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">_binary_table_bin_start</span><span class="p">[];</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">_binary_table_bin_end</span><span class="p">[];</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">start</span> <span class="o">=</span> <span class="n">_binary_table_bin_start</span><span class="p">;</span>
    <span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">end</span>   <span class="o">=</span> <span class="n">_binary_table_bin_end</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">long</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="n">start</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">end</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%lld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="o">*</span><span class="n">x</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, there are some good reasons not to use this feature in this
way:</p>

<ol>
  <li>
    <p>The format of <code class="language-plaintext highlighter-rouge">table.bin</code> is specific to the host architecture
(byte order, size, padding, etc.). If the host is the same as the
target then this isn’t a problem, but it will prohibit
cross-compilation.</p>
  </li>
  <li>
    <p>The linker has no information about the alignment requirements of
the data. To the linker it’s just a byte buffer. In the final
program the <code class="language-plaintext highlighter-rouge">long long</code> array will not necessarily aligned properly
for its type, meaning the above program might crash. The Right Way
is to never dereference the data directly but rather <code class="language-plaintext highlighter-rouge">memcpy()</code> it
into a properly-aligned variable, just as if the data was an
unaligned buffer.</p>
  </li>
  <li>
    <p>The data structure cannot use any pointers. Pointer values are
meaningless to other processes and will be no different than
garbage.</p>
  </li>
</ol>

<h3 id="towards-a-more-portable-approach">Towards a more portable approach</h3>

<p>There’s an easy way to address all three of these problems <em>and</em>
eliminate the reliance on GNU linkers: serialize the data into C code.
<em>It’s metaprogramming, baby.</em></p>

<p>In the Fibonacci example, change the <code class="language-plaintext highlighter-rouge">fwrite()</code> in <code class="language-plaintext highlighter-rouge">table_gen.c</code> to
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">printf</span><span class="p">(</span><span class="s">"int table_size = %d;</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">TABLE_SIZE</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"long long table[] = {</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TABLE_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"    %lldLL,</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"};</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>The output of the program becomes text:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">table_size</span> <span class="o">=</span> <span class="mi">90</span><span class="p">;</span>
<span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="mi">1LL</span><span class="p">,</span>
    <span class="mi">1LL</span><span class="p">,</span>
    <span class="mi">2LL</span><span class="p">,</span>
    <span class="mi">3LL</span><span class="p">,</span>
    <span class="cm">/* ... */</span>
    <span class="mi">1779979416004714189LL</span><span class="p">,</span>
    <span class="mi">2880067194370816120LL</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">print_fib.c</code> is changed to:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">int</span> <span class="n">table_size</span><span class="p">;</span>
<span class="k">extern</span> <span class="kt">long</span> <span class="kt">long</span> <span class="n">table</span><span class="p">[];</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">table_size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%lld</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Putting it all together:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen &gt; table.c
$ gcc -std=c99 -o print_fib print_fib.c table.c
</code></pre></div></div>

<p>Any C compiler and linker could do all of this, no problem, making it
more portable. The intermediate metaprogram isn’t a barrier to cross
compilation. It would be compiled for the host (typically identified
through <code class="language-plaintext highlighter-rouge">HOST_CC</code>) and the rest is compiled for the target (e.g.
<code class="language-plaintext highlighter-rouge">CC</code>).</p>

<p>The output of <code class="language-plaintext highlighter-rouge">table_gen.c</code> isn’t dependent on any architecture,
making it cross-compiler friendly. There are also no alignment
problems because it’s all visible to compiler. The type system isn’t
being undermined.</p>

<h3 id="dealing-with-pointers">Dealing with pointers</h3>

<p>The Fibonacci example doesn’t address the pointer problem — it has no
pointers to speak of. So let’s step it up to a trie using the <a href="/blog/2016/11/13/">trie
from the previous article</a>. As a reminder, here it is:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define TRIE_ALPHABET_SIZE  4
#define TRIE_TERMINAL_FLAG  (1U &lt;&lt; 0)
</span>
<span class="k">struct</span> <span class="n">trie</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">next</span><span class="p">[</span><span class="n">TRIE_ALPHABET_SIZE</span><span class="p">];</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Dumping these structures out raw would definitely be useless since
they’re almost entirely pointer data. So instead, fill out an array of
these structures, referencing the array itself to build up the
pointers (later filled in by either the linker or the loader). This
code uses the <a href="/blog/2016/11/13/">in-place breadth-first traversal technique</a> from
the previous article.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">trie_serialize</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"struct trie %s[] = {</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">head</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"    {​{"</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">comma</span> <span class="o">=</span> <span class="n">i</span> <span class="o">?</span> <span class="s">", "</span> <span class="o">:</span> <span class="s">""</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
                <span class="cm">/* Add child to the queue. */</span>
                <span class="n">tail</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
                <span class="n">tail</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
                <span class="n">next</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
                <span class="cm">/* Print the pointer to the child. */</span>
                <span class="n">printf</span><span class="p">(</span><span class="s">"%s%s + %zu"</span><span class="p">,</span> <span class="n">comma</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="o">++</span><span class="n">count</span><span class="p">);</span>
            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                <span class="n">printf</span><span class="p">(</span><span class="s">"%s0"</span><span class="p">,</span> <span class="n">comma</span><span class="p">);</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"}, 0, 0, %u},</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">);</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"};</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Remember that list of strings from before?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AAAAA
ABCD
CAA
CAD
CDBD
</code></pre></div></div>

<p>Which looks like this?</p>

<p><img src="/img/trie/trie.svg" alt="" /></p>

<p>That serializes to this C code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie</span> <span class="n">root</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">3</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">6</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">10</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">13</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">root</span> <span class="o">+</span> <span class="mi">14</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="n">root</span> <span class="o">+</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
    <span class="p">{</span><span class="err">​</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This trie can be immediately used at program startup without
initialization, and it can even have new nodes inserted into it. It’s
not without its downsides, particularly because it’s a trie:</p>

<ol>
  <li>
    <p>It’s <em>really</em> going to blow up the size of the binary, especially
when it holds lots of strings. These nodes are anything but
compact.</p>
  </li>
  <li>
    <p>If the code is compiled to be position-independent (<code class="language-plaintext highlighter-rouge">-fPIC</code>), each
of those nodes is going to hold multiple dynamic relocations,
further exploding the size of the binary and <a href="/blog/2016/10/27/">preventing the trie
from being shared between processes</a>. It’s 24 bytes per
relocation on x86-64. This will also slow down program start up
time. With just a few thousand strings, the simple test program was
taking 5x longer to start (25ms instead of 5ms) than with an empty
trie.</p>
  </li>
  <li>
    <p>Even without being position-independent, the linker will have to
resolve all the compile-time relocations. I was able to overwhelm
linker and run it out of memory with just some tens of thousands of
strings. This would make for a decent linker stress test.</p>
  </li>
</ol>

<p>This technique obviously doesn’t scale well with trie data. You’re
better off baking in the flat string list and building the trie at run
time — though you <em>could</em> compute the exact number of needed nodes at
compile time and statically allocate them (in .bss). I’ve personally
had much better luck with <a href="https://github.com/skeeto/yavalath">other sorts of lookup tables</a>.
It’s a useful tool for the C programmer’s toolbelt.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Zero-allocation Trie Traversal</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/13/"/>
    <id>urn:uuid:38dd798b-9e27-3109-590c-3a8482f634a7</id>
    <updated>2016-11-13T06:03:24Z</updated>
    <category term="c"/><category term="compsci"/>
    <content type="html">
      <![CDATA[<p>As part of a demonstration in <a href="/blog/2016/11/15/">an upcoming article</a>, I wrote a
simple <a href="https://en.wikipedia.org/wiki/Trie">trie</a> implementation. A trie is a search tree where the
keys are a sequence of symbols (i.e. strings). Strings with a common
prefix share an initial path down the trie, and the keys themselves
are stored implicitly by the structure of the trie. It’s commonly used
as a sorted set or, when values are associated with nodes, an
associative array.</p>

<p>This wasn’t my first time writing a trie. The curse of programming in
C is rewriting the same data structures and algorithms over and over.
It’s the problem C++ templates are intended to solve. This rewriting
isn’t always bad since each implementation is typically customized for
its specific use, often resulting in greater performance and a smaller
resource footprint.</p>

<p>Every time I’ve rewritten a trie, my implementation is a little bit
better than the last. This time around I discovered an approach for
traversing, both depth-first and breadth-first, an arbitrarily-sized
trie without memory allocation. I’m definitely not the first to
discover something like this. There’s <a href="https://xlinux.nist.gov/dads/HTML/SchorrWaiteGraphMarking.html">Deutsch-Schorr-Waite pointer
reversal</a> for binary graphs (1965) — which I originally learned
from reading the <a href="http://t3x.org/s9fes/">Scheme 9 from Outer Space</a> garbage collector
source — and <a href="http://www.geeksforgeeks.org/morris-traversal-for-preorder/">Morris in-order traversal</a> (1979) for binary
trees. The former requires two extra tag bits per node and the latter
requires no modifications at all.</p>

<h3 id="whats-a-trie">What’s a trie?</h3>

<p>But before I go further, some background. A trie can come in many
shapes and sizes, but in the simple case each node of a trie has as
many pointers as its alphabet. For illustration purposes, imagine a
trie for strings of only four characters: A, B, C, and D. Each node is
essentially four pointers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define TRIE_ALPHABET_SIZE  4
#define TRIE_STATIC_INIT    {.flags = 0}
#define TRIE_TERMINAL_FLAG  (1U &lt;&lt; 0)
</span>
<span class="k">struct</span> <span class="n">trie</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">next</span><span class="p">[</span><span class="n">TRIE_ALPHABET_SIZE</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>It includes a <code class="language-plaintext highlighter-rouge">flags</code> field, where a single bit tracks whether or not
a node is terminal — that is, a key terminates at this node. Terminal
nodes are not necessarily leaf nodes, which is the case when one key
is a prefix of another key. I could instead have used a 1-bit
bit-field (e.g. <code class="language-plaintext highlighter-rouge">int is_terminal : 1;</code>) but I don’t like bit-fields.</p>

<p>A trie with the following keys, inserted in any order:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AAAAA
ABCD
CAA
CAD
CDBD
</code></pre></div></div>

<p>Looks like this (terminal nodes illustrated as small black squares):</p>

<p><img src="/img/trie/trie.svg" alt="" /></p>

<p>The root of the trie is the empty string, and each child represents a
trie prefixed with one of the symbols from the alphabet. This is a
nice recursive definition, and it’s tempting to write recursive
functions to process it. For example, here’s a recursive insertion
function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">trie_insert_recursive</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">-</span> <span class="sc">'A'</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">trie</span><span class="p">)</span><span class="n">TRIE_STATIC_INIT</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">trie_insert_recursive</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the string is empty (<code class="language-plaintext highlighter-rouge">!*s</code>), mark the current node as terminal.
Otherwise recursively insert the substring under the appropriate
child. That’s a tail call, and any optimizing compiler would optimize
this call into a jump back to the beginning of of the function
(tail-call optimization), reusing the stack frame as if it were a
simple loop.</p>

<p>If that’s not good enough, such as when optimization is disabled for
debugging and the recursive definition is blowing the stack, this is
trivial to convert to a safe, iterative function. I prefer this
version anyway.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">trie_insert</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span> <span class="n">s</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">-</span> <span class="sc">'A'</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]));</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
            <span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">trie</span><span class="p">)</span><span class="n">TRIE_STATIC_INIT</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finding a particular prefix in the trie iteratively is also easy. This
would be used to narrow the trie to a chosen prefix before iterating
over the keys (e.g. find all strings matching a prefix).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span>
<span class="nf">trie_find</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span> <span class="n">s</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="o">*</span><span class="n">s</span> <span class="o">-</span> <span class="sc">'A'</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
            <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
        <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">t</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Depth-first traversal is <em>stack-oriented</em>. The stack represents the
path through the graph, and each new vertex is pushed into this stack
as it’s visited. A recursive traversal function can implicitly use the
call stack for storing this information, so no additional data
structure is needed.</p>

<p>The downside is that the call is no longer tail-recursive, so a large
trie will blow the stack. Also, the caller needs to provide a callback
function because the stack cannot unwind to return a value: The stack
has important state on it. Here’s a typedef for the callback.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="nf">void</span> <span class="p">(</span><span class="o">*</span><span class="n">trie_visitor</span><span class="p">)(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">);</span>
</code></pre></div></div>

<p>And here’s the recursive depth-first traversal function. The top-level
caller passes the same buffer for <code class="language-plaintext highlighter-rouge">buf</code> and <code class="language-plaintext highlighter-rouge">bufend</code>, which must be at
least as large as the largest key. The visited key will be written to
this buffer and passed to the visitor.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">trie_dfs_recursive</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span>
                   <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span>
                   <span class="kt">char</span> <span class="o">*</span><span class="n">bufend</span><span class="p">,</span>
                   <span class="n">trie_visitor</span> <span class="n">v</span><span class="p">,</span>
                   <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">)</span> <span class="p">{</span>
        <span class="o">*</span><span class="n">bufend</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="n">v</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="o">*</span><span class="n">bufend</span> <span class="o">=</span> <span class="sc">'A'</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
            <span class="n">trie_dfs_recursive</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">buf</span><span class="p">,</span> <span class="n">bufend</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="n">arg</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="heap-allocated-traversal-stack">Heap-allocated Traversal Stack</h4>

<p>Moving the traversal stack to the heap would eliminate the stack
overflow problem and it would allow control to return to the caller.
This is going to be a lot of code for an article, but bear with me.</p>

<p>First define an iterator object. The stack will need two pieces of
information: which node did we come from (<code class="language-plaintext highlighter-rouge">p</code>) and through which
pointer (<code class="language-plaintext highlighter-rouge">i</code>). When a node has been exhausted, this will allow return
to the parent. The <code class="language-plaintext highlighter-rouge">root</code> field tracks when traversal is complete.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie_iter</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">root</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">bufend</span><span class="p">;</span>
    <span class="k">struct</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="p">}</span> <span class="o">*</span><span class="n">stack</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>A special value of -1 in <code class="language-plaintext highlighter-rouge">i</code> means it’s the first visit for this node
and it should be visited by the callback if it’s terminal.</p>

<p>The iterator is initialized with <code class="language-plaintext highlighter-rouge">trie_iter_init</code>. The <code class="language-plaintext highlighter-rouge">max</code> indicates
the maximum length of any key. A more elaborate implementation could
automatically grow the stack to accommodate (e.g. realloc()), but I’m
keeping it as simple as possible.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">trie_iter_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_iter</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">max</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">root</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">)</span> <span class="o">*</span> <span class="n">max</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">bufend</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">max</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span>
<span class="nf">trie_iter_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_iter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="p">);</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">free</span><span class="p">(</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">);</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And finally the complicated part. This uses the allocated stack to
explore the trie in a loop until it hits a terminal, at which point it
returns. A further call continues the traversal from where it left
off. It’s like a hand-coded <a href="https://en.wikipedia.org/wiki/Generator_(computer_programming)">generator</a>. With the way it’s
written, the caller is obligated to follow through with the entire
iteration before destroying the iterator, but this would be easy to
correct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">trie_iter_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_iter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">trie</span> <span class="o">*</span><span class="n">current</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">i</span><span class="o">++</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* Return result if terminal node. */</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">)</span> <span class="p">{</span>
                <span class="o">*</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">bufend</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* End of current node. */</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current</span> <span class="o">==</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">)</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// back at root, done</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">--</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">bufend</span><span class="o">--</span><span class="p">;</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="cm">/* Push on next child node. */</span>
            <span class="o">*</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">bufend</span> <span class="o">=</span> <span class="sc">'A'</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">++</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">bufend</span><span class="o">++</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">stack</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is <em>much</em> nicer for the caller since there’s no control inverse.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie_iter</span> <span class="n">it</span><span class="p">;</span>
<span class="n">trie_iter_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">trie_root</span><span class="p">,</span> <span class="n">KEY_MAX</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">trie_iter_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">))</span> <span class="p">{</span>
    <span class="c1">// ... do something with it.buf ...</span>
<span class="p">}</span>
<span class="n">trie_iter_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">it</span><span class="p">);</span>
</code></pre></div></div>

<p>There are a few downsides to this:</p>

<ol>
  <li>
    <p>Initialization could fail (not checked in the example) since it
allocates memory.</p>
  </li>
  <li>
    <p>Either the caller has to keep track of the maximum key length, or
the iterator grows the stack automatically, which would mean
iteration could fail at any point in the middle.</p>
  </li>
  <li>
    <p>In order to destroy the trie, it needs to be traversed: Freeing
memory first requires allocating memory. If the program is out of
memory, it cannot destroy the trie to clean up before handling the
situation, nor to make more memory available. It’s not good for
resilience.</p>
  </li>
</ol>

<p>Wouldn’t it be nice to traverse the trie without memory allocation?</p>

<h3 id="modifying-the-trie">Modifying the Trie</h3>

<p>Rather than allocate a separate stack, the stack can be allocated
across the individual nodes of the trie. Remember those <code class="language-plaintext highlighter-rouge">p</code> and <code class="language-plaintext highlighter-rouge">i</code>
fields from before? Put them on the trie.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie_v2</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">next</span><span class="p">[</span><span class="n">TRIE_ALPHABET_SIZE</span><span class="p">];</span>
    <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p><img src="/img/trie/trie_v2.svg" alt="" /></p>

<p>This automatically scales with the size of the trie, so there will
always be enough of this stack. With the stack “pre-allocated” like
this, traversal requires no additional memory allocation.</p>

<p>The iterator itself becomes a little simpler. It cannot fail and it
doesn’t need a destructor.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">trie_v2_iter</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">current</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">void</span>
<span class="nf">trie_v2_iter_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_v2_iter</span> <span class="o">*</span><span class="n">it</span><span class="p">,</span> <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The iteration function itself is almost identical to before. Rather
than increment a stack pointer, it uses <code class="language-plaintext highlighter-rouge">p</code> to chain the nodes as a
linked list.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">trie_v2_iter_next</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_v2_iter</span> <span class="o">*</span><span class="n">it</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">current</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">i</span><span class="o">++</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* Return result if terminal node. */</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">TRIE_TERMINAL_FLAG</span><span class="p">)</span> <span class="p">{</span>
                <span class="o">*</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* End of current node. */</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">)</span>
                <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span><span class="o">--</span><span class="p">;</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
            <span class="cm">/* Push on next child node. */</span>
            <span class="o">*</span><span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">=</span> <span class="sc">'A'</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">buf</span><span class="o">++</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">current</span><span class="p">;</span>
            <span class="n">it</span><span class="o">-&gt;</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">i</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>

    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>During traversal the iteration pointers look something like this:</p>

<p><img src="/img/trie/trie_v2-dfs.svg" alt="" /></p>

<p>This is not without its downsides:</p>

<ol>
  <li>
    <p>Traversal is not re-entrant nor thread-safe. It’s not possible to
run multiple in-place iterators side by side on the same trie since
they’ll clobber each other.</p>
  </li>
  <li>
    <p>It uses more memory — O(n) rather than O(max-key-length) — and sits
on this extra memory for its entire lifetime.</p>
  </li>
</ol>

<h4 id="breadth-first-traversal">Breadth-first Traversal</h4>

<p>The same technique can be used for breadth-first search, which is
<em>queue-oriented</em> rather than stack-oriented. The <code class="language-plaintext highlighter-rouge">p</code> pointers are
instead chained into a queue, with a <code class="language-plaintext highlighter-rouge">head</code> and <code class="language-plaintext highlighter-rouge">tail</code> pointer
variable for each end. As each node is visited, its children are
pushed into the queue linked list.</p>

<p>This isn’t good for visiting keys by name. <code class="language-plaintext highlighter-rouge">buf</code> was itself a stack
and played nicely with depth-first traversal, but there’s no easy way
to build up a key in a buffer breadth-first. So instead here’s a
function to destroy a trie breadth-first.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">trie_v2_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">t</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="n">t</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">head</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">TRIE_ALPHABET_SIZE</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">next</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
                <span class="n">tail</span><span class="o">-&gt;</span><span class="n">p</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
                <span class="n">tail</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="k">struct</span> <span class="n">trie_v2</span> <span class="o">*</span><span class="n">dead</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span>
        <span class="n">head</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">p</span><span class="p">;</span>
        <span class="n">free</span><span class="p">(</span><span class="n">dead</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>During its traversal the <code class="language-plaintext highlighter-rouge">p</code> pointers link up like so:</p>

<p><img src="/img/trie/trie_v2-bfs.svg" alt="" /></p>

<h3 id="further-research">Further Research</h3>

<p>In my real code there’s also a flag to indicate the node’s allocation
type: static or heap. This allows a trie to be composed of nodes from
both kinds of allocations while still safe to destroy. It might also
be useful to pack a reference counter into this space so that a node
could be shared by more than one trie.</p>

<p>For a production implementation it may be worth packing <code class="language-plaintext highlighter-rouge">i</code> into the
<code class="language-plaintext highlighter-rouge">flags</code> field since it only needs a few bits, even with larger
alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the <code class="language-plaintext highlighter-rouge">p</code> field
could be eliminated and instead one of the child pointers is
temporarily reversed. With these changes, this technique would fit
into the original <code class="language-plaintext highlighter-rouge">struct trie</code> without changes, eliminating the extra
memory usage.</p>

<p>Update: Over on Hacker News, <a href="https://news.ycombinator.com/item?id=12943339">psi-squared has interesting
suggestions</a> such as leaving the traversal pointers intact,
particularly in the case of a breadth-first search, which, until the
next trie modification, allows for concurrent follow-up traversals.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Emacs, Dynamic Modules, and Joysticks</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/11/05/"/>
    <id>urn:uuid:c53305bb-4770-3a7f-934c-31eea37d38eb</id>
    <updated>2016-11-05T04:01:51Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Two months ago Emacs 25 was released and introduced a <a href="http://diobla.info/blog-archive/modules-tut.html">new dynamic
module feature</a>. Emacs can now load shared libraries built
against Emacs’ module API, defined in <a href="http://git.savannah.gnu.org/cgit/emacs.git/tree/src/emacs-module.h?h=emacs-25.1">emacs-module.h</a>. What’s
interesting about this API is that it doesn’t require linking against
Emacs or any sort of library. Instead, at run time Emacs supplies the
module’s initialization function with function pointers for the entire
API.</p>

<p>As a demonstration, in this article I’ll build an Emacs joystick
interface (Linux only) using a dynamic module. It will allow Emacs to
read events from any joystick on the system. All the source code is
here:</p>

<ul>
  <li><a href="https://github.com/skeeto/joymacs">https://github.com/skeeto/joymacs</a></li>
</ul>

<p>It includes a calibration interface (<code class="language-plaintext highlighter-rouge">M-x joydemo</code>) within Emacs:</p>

<p><a href="/img/joymacs/joymacs.png"><img src="/img/joymacs/joymacs-thumb.png" alt="" /></a></p>

<p>Currently, Emacs’ emacs-module.h header is the entirety of the module
documentation. It’s a bit thin and leaves ambiguities that requires
some reading of the Emacs source code. Even reading the source, it’s
not clear which behaviors are a reliable part of the interface. For
example, if there’s a pending non-local exit, it’s safe for a function
to return <code class="language-plaintext highlighter-rouge">NULL</code> since the return value is never inspected (Emacs
25.1), but will this always be the case? While mistakes are
unforgiving (a hard crash), the API is mostly intuitive and it’s been
pretty easy to feel my way around it.</p>

<p><em>Update</em>: Philipp Stephani has <a href="https://phst.github.io/emacs-modules">written thorough, reliable module
documentation</a>.</p>

<h3 id="dynamic-module-types">Dynamic Module Types</h3>

<p>All Emacs values — integers, floats, cons cells, vectors, strings,
etc. — are represented as the polymorphic, pointer-valued type,
<code class="language-plaintext highlighter-rouge">emacs_value</code>. Despite being a pointer, <code class="language-plaintext highlighter-rouge">NULL</code> is not a valid value,
as convenient as that would be. The API includes functions for
creating and extracting the fundamental types: integers, floats,
strings. Almost all other object types can only be accessed by making
Lisp function calls to regular Emacs functions from the module.</p>

<p>Modules also introduce a brand new Emacs object type: a <em>user
pointer</em>. These are <a href="/blog/2013/12/30/">non-readable</a>, opaque pointer values
returned by modules, typically representing a handle to some resource,
be it a memory block, database connection, or a joystick. These
objects include a finalizer function pointer — which, surprisingly, is
not permitted to be NULL — and their lifetime is managed by Emacs’
garbage collector.</p>

<p>User pointers are a somewhat dangerous feature since there’s little to
stop Emacs Lisp code from misusing them. A Lisp program can take a
user pointer from one module and pass it to a function in a different
module. Since it’s just a pointer, there’s no way to type check it. At
best, a module could maintain a table of all its live pointers,
checking all user pointer arguments against the table before
dereferencing. But I don’t expect this to be normal practice.</p>

<h3 id="module-initialization">Module Initialization</h3>

<p>After loading the module through the platform’s mechanism, the first
thing Emacs does is check for the symbol <code class="language-plaintext highlighter-rouge">plugin_is_GPL_compatible</code>.
While tacky, this is not surprising given the culture around Emacs.</p>

<p>Next it calls <code class="language-plaintext highlighter-rouge">emacs_module_init()</code>, passing it the first function
pointer. From this, the module can get a Lisp environment and start
doing Emacs things, such as binding module functions to Lisp symbols.</p>

<p>Here’s a complete “Hello, world!” example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"emacs-module.h"</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">plugin_is_GPL_compatible</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">emacs_module_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">emacs_runtime</span> <span class="o">*</span><span class="n">ert</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span> <span class="o">=</span> <span class="n">ert</span><span class="o">-&gt;</span><span class="n">get_environment</span><span class="p">(</span><span class="n">ert</span><span class="p">);</span>
    <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"message"</span><span class="p">);</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="n">hi</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!"</span><span class="p">;</span>
    <span class="n">emacs_value</span> <span class="n">string</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">hi</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">hi</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">string</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In a real module, it’s common to create function objects for native
functions, then fetch the <code class="language-plaintext highlighter-rouge">fset</code> symbol and make a Lisp call on it to
bind the newly-created function object to a name. You’ll see this in
action later.</p>

<h3 id="joystick-api">Joystick API</h3>

<p>The joystick API will closely resemble <a href="https://www.kernel.org/doc/Documentation/input/joystick-api.txt">Linux’s own joystick API</a>,
making for a fairly thin wrapper. It’s so thin that Emacs <em>almost</em>
doesn’t even need a dynamic module. This is because, on Linux,
joysticks are just files under <code class="language-plaintext highlighter-rouge">/dev/input/</code>. Want to see the input
events on the first joystick? Just read <code class="language-plaintext highlighter-rouge">/dev/input/js0</code>. So Plan 9.</p>

<p>Emacs already knows how to read files, but these virtual files are a
little <em>too</em> special for that. The header <code class="language-plaintext highlighter-rouge">linux/joystick.h</code> defines a
<code class="language-plaintext highlighter-rouge">struct js_event</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">js_event</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">time</span><span class="p">;</span>  <span class="cm">/* event timestamp in milliseconds */</span>
    <span class="kt">int16_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="kt">uint8_t</span> <span class="n">type</span><span class="p">;</span>
    <span class="kt">uint8_t</span> <span class="n">number</span><span class="p">;</span> <span class="cm">/* axis/button number */</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The idea is to read from the joystick device into this structure. The
first several reads are initialization that define the axes and
buttons of the joystick and their initial state. Further events are
queued up for the file descriptor. This all means that the file can’t
just be opened each time joystick input is needed. It has to be held
open for the duration, and is typically configured non-blocking.</p>

<p>The Emacs package will be called <code class="language-plaintext highlighter-rouge">joymacs</code> and there will be three
functions:</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">joymacs-open</span> <span class="nv">N</span><span class="p">)</span>
<span class="p">(</span><span class="nv">joymacs-close</span> <span class="nv">JOYSTICK</span><span class="p">)</span>
<span class="p">(</span><span class="nv">joymacs-read</span> <span class="nv">JOYSTICK</span> <span class="nv">EVENT-VECTOR</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="joymacs-open">joymacs-open</h4>

<p>The <code class="language-plaintext highlighter-rouge">joymacs-open</code> function will take an integer, opening the Nth
joystick (<code class="language-plaintext highlighter-rouge">/dev/input/jsN</code>). It will create a file descriptor for the
joystick device, returning it as a user pointer. Think of it as a sort
of “joystick handle.” Now, it <em>could</em> instead return the file
descriptor as an integer, but the user pointer has two significant
benefits:</p>

<ol>
  <li>
    <p><strong>The resource will be garbage collected.</strong> If the caller loses
track of a file descriptor returned as an integer, the joystick
device will be held open until Emacs shuts down, using up one of
Emacs’ file descriptors. By putting it in a user pointer, the
garbage collector will have the module to release the file
descriptor if the user loses track of it.</p>
  </li>
  <li>
    <p><strong>It should be difficult for the user to make a dangerous call.</strong>
Emacs Lisp can’t create user pointers — they only come from modules
— and so the module is less likely to get passed the wrong thing.
In the case of <code class="language-plaintext highlighter-rouge">joystick-close</code>, the module will be calling
<code class="language-plaintext highlighter-rouge">close(2)</code> on the argument. We definitely don’t want to make that
system call on file descriptors owned by Emacs. Further, since user
pointers are mutable, the module can ensure it doesn’t call
<code class="language-plaintext highlighter-rouge">close(2)</code> twice.</p>
  </li>
</ol>

<p>Here’s the implementation for <code class="language-plaintext highlighter-rouge">joymacs-open</code>. I’ll over over each part
in detail.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_open</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">extract_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">buflen</span> <span class="o">=</span> <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/dev/input/js%d"</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">O_RDONLY</span> <span class="o">|</span> <span class="n">O_NONBLOCK</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">buflen</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_close</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The C function name doesn’t matter to Emacs. It’s <code class="language-plaintext highlighter-rouge">static</code> because it
doesn’t even matter if the function visible to Emacs. It will get the
function pointer later as part of initialization.</p>

<p>This is the prototype for all functions callable by Emacs Lisp,
regardless of its arity. It has four arguments:</p>

<ol>
  <li>
    <p>It gets an environment, <code class="language-plaintext highlighter-rouge">env</code>, through which to call back into
Emacs.</p>
  </li>
  <li>
    <p>It gets <code class="language-plaintext highlighter-rouge">n</code>, the number of arguments. This is guaranteed to be the
correct number of arguments, as specified later when creating the
function object, so only variadic functions need to inspect this
argument.</p>
  </li>
  <li>
    <p>The Lisp arguments are passed as an array of values, <code class="language-plaintext highlighter-rouge">args</code>.
There’s no type declaration when declaring a function object, so
these may be of the wrong type. I’ll go over how to deal with this.</p>
  </li>
  <li>
    <p>Finally, it gets an arbitrary pointer, supplied at function object
creation time. This allows the module to create closures, but will
usually be ignored.</p>
  </li>
</ol>

<p>The first thing the function does is extract its integer argument.
This is actually an <code class="language-plaintext highlighter-rouge">intmax_t</code>, but I don’t think anyone has that many
USB ports. An <code class="language-plaintext highlighter-rouge">int</code> will suffice.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">extract_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
</code></pre></div></div>

<p>As for not underestimating fools, what if the user passed a value that
isn’t an integer? Will the world come crashing down? Fortunately Emacs
checks that in <code class="language-plaintext highlighter-rouge">extract_integer</code> and, if there’s a mismatch, sets a
pending error signal in the environment. This is really great because
checking types directly in the module is a <em>real pain the ass</em>. So,
before committing to anything further, such as opening a file, I check
for this signal and bail out early if necessary. In Emacs 25.1 it’s
safe to return NULL since the return value will be completely ignored,
but I’d rather hedge my bets.</p>

<p>By the way, the <code class="language-plaintext highlighter-rouge">nil</code> here is a global variable set in initialization.
You don’t just get that for free!</p>

<p>The next step is opening the joystick device, read-only and
non-blocking. The non-blocking is vital because the module would
otherwise hang Emacs later if there are no events (well, except for
the read being quickly interrupted by a POSIX signal).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">int</span> <span class="n">buflen</span> <span class="o">=</span> <span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/dev/input/js%d"</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">O_RDONLY</span> <span class="o">|</span> <span class="n">O_NONBLOCK</span><span class="p">);</span>
</code></pre></div></div>

<p>If the joystick fails to open (e.g. it doesn’t exist, or the user
lacks permission), manually set an error signal for a non-local exit.
I chose the <code class="language-plaintext highlighter-rouge">file-error</code> signal and I’m just using the filename as the
signal data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">buflen</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Otherwise create the user pointer. No need to allocate any memory;
just stuff it in the pointer itself. If the user mistakenly passes it
to another module, it will sure be in for a surprise when it tries to
dereference it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">return</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fin_close</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">fin_close()</code> function is defined as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">fin_close</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">fdptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">fdptr</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The garbage collector will call this function when the user pointer is
lost. If the user closes it early with <code class="language-plaintext highlighter-rouge">joymacs-close</code>, that function
will set the user pointer to -1, an invalid file descriptor, so that
it doesn’t get closed a second time here.</p>

<h4 id="joymacs-close">joymacs-close</h4>

<p>Here’s <code class="language-plaintext highlighter-rouge">joymacs-close</code>, which is a bit simpler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_close</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">set_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Again, it starts by extracting its argument, relying on Emacs to do
the check:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
</code></pre></div></div>

<p>If the user pointer hasn’t been closed yet, then close it and strip
out the file descriptor to prevent further closes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">set_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<h4 id="joymacs-read">joymacs-read</h4>

<p>The <code class="language-plaintext highlighter-rouge">joymacs-read</code> function is doing something a little unusual for an
Emacs Lisp function. It takes two arguments: the joystick handle and a
5-element vector. Instead of returning the event in some
representation, it fills the vector with the event details. The are
two reasons for this:</p>

<ol>
  <li>
    <p>The API has no function for creating vectors … though the module
<em>could</em> get the <code class="language-plaintext highlighter-rouge">make-symbol</code> vector and call it to create a
vector.</p>
  </li>
  <li>
    <p>The idiom for event pumps is for the caller to supply a buffer to
the pump. This has better performance by avoiding lots of
unnecessary allocations, especially since events tend to be
message-like objects with a short, well-defined extent.</p>
  </li>
</ol>

<p>Here’s the full definition:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">emacs_value</span>
<span class="nf">joymacs_read</span><span class="p">(</span><span class="n">emacs_env</span> <span class="o">*</span><span class="n">env</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">n</span><span class="p">,</span> <span class="n">emacs_value</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">n</span><span class="p">;</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">ptr</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="p">(</span><span class="kt">intptr_t</span><span class="p">)</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">get_user_ptr</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_check</span><span class="p">(</span><span class="n">env</span><span class="p">)</span> <span class="o">!=</span> <span class="n">emacs_funcall_exit_return</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">js_event</span> <span class="n">e</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">e</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">==</span> <span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* No more events. */</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* An actual read error (joystick unplugged, etc.). */</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">);</span>
        <span class="kt">size_t</span> <span class="n">len</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">error</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">error</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="cm">/* Fill out event vector. */</span>
        <span class="n">emacs_value</span> <span class="n">v</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">emacs_value</span> <span class="n">type</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_BUTTON</span> <span class="o">?</span> <span class="n">button</span> <span class="o">:</span> <span class="n">axis</span><span class="p">;</span>
        <span class="n">emacs_value</span> <span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="n">button</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="n">value</span> <span class="o">=</span>  <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_float</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">INT16_MAX</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">time</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">type</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">number</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_INIT</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As before, extract the first argument and check for a signal. Then
call <code class="language-plaintext highlighter-rouge">read(2)</code> to get an event. If the read fails with <code class="language-plaintext highlighter-rouge">EAGAIN</code>, it’s
not a real failure. There are just no more events, so return nil.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">struct</span> <span class="n">js_event</span> <span class="n">e</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">e</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&amp;&amp;</span> <span class="n">errno</span> <span class="o">==</span> <span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* No more events. */</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If the read failed with something else — perhaps the joystick was
unplugged — signal an error. The <code class="language-plaintext highlighter-rouge">strerror(3)</code> string is used for the
signal data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* An actual read error (joystick unplugged, etc.). */</span>
        <span class="n">emacs_value</span> <span class="n">signal</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"file-error"</span><span class="p">);</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">error</span> <span class="o">=</span> <span class="n">strerror</span><span class="p">(</span><span class="n">errno</span><span class="p">);</span>
        <span class="n">emacs_value</span> <span class="n">message</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_string</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">error</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">error</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">non_local_exit_signal</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">signal</span><span class="p">,</span> <span class="n">message</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">nil</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Otherwise fill out the event vector. If the second argument isn’t a
vector, or if it’s too short, the signal will automatically get raised
by Emacs. The module can keep plowing through the <code class="language-plaintext highlighter-rouge">vec_set()</code> calls
safely since it’s not committing to anything.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="cm">/* Fill out event vector. */</span>
        <span class="n">emacs_value</span> <span class="n">v</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
        <span class="n">emacs_value</span> <span class="n">type</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_BUTTON</span> <span class="o">?</span> <span class="n">button</span> <span class="o">:</span> <span class="n">axis</span><span class="p">;</span>
        <span class="n">emacs_value</span> <span class="n">value</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">type</span> <span class="o">==</span> <span class="n">button</span><span class="p">)</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="n">value</span> <span class="o">=</span>  <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_float</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">value</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">INT16_MAX</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">time</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">type</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_integer</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">number</span><span class="p">));</span>
        <span class="n">env</span><span class="o">-&gt;</span><span class="n">vec_set</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">&amp;</span> <span class="n">JS_EVENT_INIT</span> <span class="o">?</span> <span class="n">t</span> <span class="o">:</span> <span class="n">nil</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
</code></pre></div></div>

<p>The Linux event struct has four fields and the function fills out five
values of the vector. This is because the <code class="language-plaintext highlighter-rouge">type</code> field has a bit flag
indicating initialization events. This is split out into an extra
t/nil value. It also normalizes axis values and converts button values
into t/nil, which makes more sense for Emacs Lisp. The event itself is
returned since it’s a truthy value and it’s convenient for the caller.</p>

<p>The astute programmer might notice that the negative side of the axis
could go just below -1.0, since <code class="language-plaintext highlighter-rouge">INT16_MIN</code> has one extra value over
<code class="language-plaintext highlighter-rouge">INT16_MAX</code> (two’s complement). It doesn’t seem to be documented, but
the joystick drivers I’ve seen never exactly return <code class="language-plaintext highlighter-rouge">INT16_MIN</code>, so
this is in fact the correct way to normalize it.</p>

<h4 id="initialization">Initialization</h4>

<p><em>Update 2021</em>: In a previous version of this article, I talked about
interning symbols during initialziation so that they do not need to be
re-interned each time the module is called. This <a href="https://github.com/skeeto/joymacs/issues/1">no longer works</a>,
and it was probably never intended to be work in the first place. The
lesson is simple: <strong>Do not reuse Emacs objects between module calls.</strong></p>

<p>First grab the <code class="language-plaintext highlighter-rouge">fset</code> symbol since this function will be needed to bind
names to the module’s functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">fset</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"fset"</span><span class="p">);</span>
</code></pre></div></div>

<p>Using <code class="language-plaintext highlighter-rouge">fset</code>, bind the functions. The second and third arguments to
<code class="language-plaintext highlighter-rouge">make_function</code> are the minimum and maximum number of arguments, which
<a href="/blog/2014/01/04/">may look familiar</a>. The last argument is that closure pointer
I mentioned at the beginning.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">args</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
    <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"joymacs-open"</span><span class="p">);</span>
    <span class="n">args</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">make_function</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">joymacs_open</span><span class="p">,</span> <span class="n">doc</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">fset</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">args</span><span class="p">);</span>
</code></pre></div></div>

<p>If the module is to be loaded with <code class="language-plaintext highlighter-rouge">require</code> like any other package,
it needs to provide: <code class="language-plaintext highlighter-rouge">(provide 'joymacs)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">emacs_value</span> <span class="n">provide</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"provide"</span><span class="p">);</span>
    <span class="n">emacs_value</span> <span class="n">joymacs</span> <span class="o">=</span> <span class="n">env</span><span class="o">-&gt;</span><span class="n">intern</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="s">"joymacs"</span><span class="p">);</span>
    <span class="n">env</span><span class="o">-&gt;</span><span class="n">funcall</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">provide</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">joymacs</span><span class="p">);</span>
</code></pre></div></div>

<p>And that’s it!</p>

<p>The source repository now includes a port to Windows (XInput). If
you’re on Linux or Windows, have Emacs 25 with modules enabled, and a
joystick is plugged in, then <code class="language-plaintext highlighter-rouge">make run</code> in the repository should bring
up Emacs running a joystick calibration demonstration. The module
can’t poke at Emacs when events are ready, so instead there’s a timer
that polls the module for events.</p>

<p>I’d like to someday see an Emacs Lisp game well-suited for a joystick.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>An Array of Pointers vs. a Multidimensional Array</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/27/"/>
    <id>urn:uuid:d1302ff9-f958-3486-134d-01c8ab84aa51</id>
    <updated>2016-10-27T21:01:33Z</updated>
    <category term="c"/><category term="linux"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In a C program, suppose I have a table of color names of similar
length. There are two straightforward ways to construct this table.
The most common would be an array of <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The other is a two-dimensional <code class="language-plaintext highlighter-rouge">char</code> array.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"red"</span><span class="p">,</span>
    <span class="s">"orange"</span><span class="p">,</span>
    <span class="s">"yellow"</span><span class="p">,</span>
    <span class="s">"green"</span><span class="p">,</span>
    <span class="s">"blue"</span><span class="p">,</span>
    <span class="s">"violet"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The initializers are identical, and the syntax by which these tables
are used is the same, but the underlying data structures are very
different. For example, suppose I had a lookup() function that
searches the table for a particular color.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lookup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">color</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">ncolors</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ncolors</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">strcmp</span><span class="p">(</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">color</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">i</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is <code class="language-plaintext highlighter-rouge">char
colors[][7]</code> or <code class="language-plaintext highlighter-rouge">char *colors[]</code>. It’s a little bit misleading because
the compiler generates different code depending on the type.</p>

<h3 id="memory-layout">Memory Layout</h3>

<p>Here’s what <code class="language-plaintext highlighter-rouge">colors_ptr</code>, a <em>jagged array</em>, typically looks like in
memory.</p>

<p><img src="/img/colortab/pointertab.png" alt="" /></p>

<p>The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression <code class="language-plaintext highlighter-rouge">colors_ptr[n]</code> is <code class="language-plaintext highlighter-rouge">char *</code>.</p>

<p>On x86-64, suppose the base of the table is in <code class="language-plaintext highlighter-rouge">rax</code>, the index of the
string I want to retrieve is <code class="language-plaintext highlighter-rouge">rcx</code>, and I want to put the string’s
address back into <code class="language-plaintext highlighter-rouge">rax</code>. It’s one load instruction.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<p>Contrast this with <code class="language-plaintext highlighter-rouge">colors_2d</code>: six 7-byte elements in a row. No
pointers or addresses. Only strings.</p>

<p><img src="/img/colortab/arraytab.png" alt="" /></p>

<p>The strings are in their defined order, packed together. The type of
the expression <code class="language-plaintext highlighter-rouge">colors_2d[n]</code> is <code class="language-plaintext highlighter-rouge">char [7]</code>, an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.</p>

<p>In the same scenario before with x86-64, it takes two instructions to
put the string’s address in <code class="language-plaintext highlighter-rouge">rax</code>, but neither is a load.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">imul</span>  <span class="nb">rcx</span><span class="p">,</span> <span class="nb">rcx</span><span class="p">,</span> <span class="mi">7</span>
<span class="nf">add</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rcx</span>
</code></pre></div></div>

<p>In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. <code class="language-plaintext highlighter-rouge">char colors_2d[][8]</code>). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">lea</span>   <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nb">rcx</span><span class="o">*</span><span class="mi">8</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="relocation">Relocation</h3>

<p>There’s another factor to consider: relocation. Nearly every process
running on a modern system takes advantage of a security feature
called Address Space Layout Randomization (ASLR). The virtual address
of code and data is randomized at process load time. For shared
libraries, it’s not just a security feature, it’s essential to their
basic operation. Libraries cannot possibly coordinate their preferred
load addresses with every other library on the system, and so must be
relocatable.</p>

<p>If the program is compiled with GCC or Clang configured for position
independent code — <code class="language-plaintext highlighter-rouge">-fPIC</code> (for libraries) or <code class="language-plaintext highlighter-rouge">-fpie</code> + <code class="language-plaintext highlighter-rouge">-pie</code> (for
programs) — extra work has to be done to support <code class="language-plaintext highlighter-rouge">colors_ptr</code>. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.</p>

<p>However, <code class="language-plaintext highlighter-rouge">colors_2d</code> doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.</p>

<p>On x86-64, in both cases the table itself typically doesn’t need a
relocation entry because it will be <em>RIP-relative</em> (in the <a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">small code
model</a>). That is, code that uses the table will be at a fixed
offset from the table no matter where the program is loaded. It won’t
need to be looked up using the Global Offset Table (GOT).</p>

<p>In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in <code class="language-plaintext highlighter-rouge">rax</code> looks
like so:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; NASM:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">address</span><span class="p">]</span>
<span class="c1">;; Some others:</span>
<span class="nf">lea</span>    <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">rip</span> <span class="o">+</span> <span class="nv">address</span><span class="p">]</span>
</code></pre></div></div>

<p>Or in AT&amp;T syntax:</p>

<pre><code class="language-gas">lea    address(%rip), %rax
</code></pre>

<h3 id="virtual-memory">Virtual Memory</h3>

<p>Besides (trivially) more work for the loader, there’s another
consequence to relocations: Pages containing relocations are not
shared between processes (except after fork()). When loading a
program, the loader doesn’t copy programs and libraries to memory so
much as it memory maps their binaries with copy-on-write semantics. If
another process is running with the same binaries loaded (e.g.
libc.so), they’ll share the same physical memory so long as those
pages haven’t been modified by either process. Modifying the page
creates a unique copy for that process.</p>

<p>Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means <code class="language-plaintext highlighter-rouge">colors_2d</code> has the possibility of being shared
between processes, but <code class="language-plaintext highlighter-rouge">colors_ptr</code> (and its entire page) definitely
does not. Shucks.</p>

<p>This is one of the reasons why the Procedure Linkage Table (PLT)
exists. The PLT is an array of function stubs for shared library
functions, such as those in the C standard library. Sure, the loader
<em>could</em> go through the program and fill out the address of every
library function call, but this would modify lots and lots of code
pages, creating a unique copy of large parts of the program. Instead,
the dynamic linker <a href="https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html">lazily supplies jump addresses</a> for PLT
function stubs, one per accessed library function.</p>

<p>However, as I’ve written it above, it’s unlikely that even <code class="language-plaintext highlighter-rouge">colors_2d</code>
will be shared. It’s still missing an important ingredient: const.</p>

<h3 id="const">Const</h3>

<p>They say <a href="/blog/2016/07/25/">const isn’t for optimization</a> but, darnit, this
situation keeps coming up. Since <code class="language-plaintext highlighter-rouge">colors_ptr</code> and <code class="language-plaintext highlighter-rouge">colors_2d</code> are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
<code class="language-plaintext highlighter-rouge">colors_2d</code> to being a local copy.</p>

<p>Fortunately it’s trivial to fix by adding a const:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="n">colors_2d</span><span class="p">[][</span><span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (<code class="language-plaintext highlighter-rouge">.rodata</code>) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.</p>

<p>Note, the equivalent for <code class="language-plaintext highlighter-rouge">colors_ptr</code> requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="k">const</span> <span class="n">colors_ptr</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">};</span>
</code></pre></div></div>

<p>String literals are already effectively const, though the C
specification (unlike C++) doesn’t actually define them to be this
way. But, like setting your relationship status on Facebook, declaring
it makes it official.</p>

<h3 id="its-just-micro-optimization">It’s just micro-optimization</h3>

<p>These little details are all deep down the path of micro-optimization
and will rarely ever matter in practice, but perhaps you learned
something broader from all this. This stuff fascinates me.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Small-Size Optimization in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/07/"/>
    <id>urn:uuid:1e249621-40bb-39f9-7e47-17fbe37c9fa4</id>
    <updated>2016-10-07T01:43:12Z</updated>
    <category term="c"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I’ve worked on many programs that frequently require small,
short-lived buffers for use as a temporary workspace, perhaps to
construct a string or array. In C this is often accomplished with
arrays of <a href="/blog/2016/10/02/">automatic storage duration</a> (i.e. allocated on the
stack). This is dirt cheap — much cheaper than a heap allocation —
and, unlike a typical general-purpose allocator, involves no thread
contention. However, the catch that there may be no hard bound to the
buffer. For correctness, the scratch space must scale appropriately to
match its input. Whatever arbitrary buffer size I pick may be too small.</p>

<p>A widespread extension to C is the alloca() pseudo-function. It’s like
malloc(), but allocates memory on the stack, just like an automatic
variable. The allocation is automatically freed when the function (not
its scope!) exits, even with a longjmp() or other non-local exit.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">alloca</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Besides its portability issues, the most dangerous property is the
<strong>complete lack of error detection</strong>. If <code class="language-plaintext highlighter-rouge">size</code> is too large, the
program simply crashes, <em>or worse</em>.</p>

<p>For example, suppose I have an intern() function that finds or creates
the canonical representation/storage for a particular string. My
program needs to intern a string composed of multiple values, and will
construct a temporary string to do so.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">intern</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>

<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">alloca</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I expect the vast majority of these <code class="language-plaintext highlighter-rouge">prefix</code> strings to be very small,
perhaps on the order of 10 to 80 bytes, and this function will handle
them extremely efficiently. But should this function get passed a huge
<code class="language-plaintext highlighter-rouge">prefix</code>, perhaps by a malicious actor, the program will misbehave
without warning.</p>

<p>A portable alternative to alloca() is variable-length arrays (VLA),
introduced in C99. Arrays with automatic storage duration need not
have a fixed, compile-time size. It’s just like alloca(), having
exactly <strong>the same dangers</strong>, but at least it’s properly scoped. It
was rejected for inclusion in C++11 due to this danger.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">buffer</span><span class="p">[</span><span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">];</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There’s a middle-ground to this, using neither VLAs nor alloca().
Suppose the function always allocates a small, fixed size buffer —
essentially a free operation — but only uses this buffer if it’s large
enough for the job. If it’s not, a normal heap allocation is made with
malloc().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">intern_identifier</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">prefix</span><span class="p">,</span> <span class="kt">long</span> <span class="n">id</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">temp</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">temp</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">strlen</span><span class="p">(</span><span class="n">prefix</span><span class="p">)</span> <span class="o">+</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">temp</span><span class="p">))</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">buffer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">size</span><span class="p">)))</span>
            <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">sprintf</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="s">"%s%ld"</span><span class="p">,</span> <span class="n">prefix</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">result</span> <span class="o">=</span> <span class="n">intern</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">buffer</span> <span class="o">!=</span> <span class="n">temp</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the function can now detect allocation errors, this version has
an error condition. Though, intern() itself would presumably return
NULL for its own allocation errors, so this is probably transparent to
the caller.</p>

<p>We’ve now entered the realm of <em>small-size optimization</em>. The vast
majority of cases are small and will therefore be very fast, but we
haven’t given up on the odd large case either. In fact, it’s been made
a little bit <em>worse</em> (via the unnecessary small allocation), selling
it out to make the common case fast. That’s sound engineering.</p>

<p>Visual Studio has a pair of functions that <em>nearly</em> automate this
solution: _malloca() and _freea(). It’s like alloca(), but
allocations beyond a certain threshold go on the heap. This allocation
is freed with _freea(), which does nothing in the case of a stack
allocation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">_malloca</span><span class="p">(</span><span class="kt">size_t</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">_freea</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</code></pre></div></div>

<p>I said “nearly” because Microsoft screwed it up: instead of returning
NULL on failure, it generates a stack overflow structured exception
(for a heap allocation failure).</p>

<p>I haven’t tried it yet, but I bet something similar to malloca() /
freea() could be implemented using a couple of macros.</p>

<h3 id="toward-structured-small-size-optimization">Toward Structured Small-Size Optimization</h3>

<p>CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the
talks. I don’t like developing in C++, but I always learn new,
interesting concepts from this conference, many of which apply
directly to C. I look forward to Chandler Carruth’s talks the most,
having learned so much from his past talks. I recommend these all:</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=fHNmRkzxHWs">Efficiency with Algorithms, Performance with Data Structures</a> (2014)</li>
  <li><a href="https://www.youtube.com/watch?v=nXaxk27zwlk">Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!</a> (2015)</li>
  <li><a href="https://www.youtube.com/watch?v=vElZc6zSIXM">High Performance Code 201: Hybrid Data Structures</a> (2016)</li>
  <li><a href="https://www.youtube.com/watch?v=FnGCDLhaxKU">Understanding Compiler Optimization</a> (2015)</li>
  <li><a href="https://www.youtube.com/watch?v=eR34r7HOU14">Optimizing the Emergent Structures of C++</a> (2013)</li>
</ul>

<p>After writing this article, I saw Nicholas Ormrod’s talk, <a href="https://www.youtube.com/watch?v=kPR8h4-qZdk">The strange
details of std::string at Facebook</a>, which is also highly
relevant.</p>

<p>Chandler’s talk this year was the one on hybrid data structures. I’d
already been mulling over small-size optimization for months, and the
first 5–10 minutes of his talk showed me I was on the right track. In
his talk he describes LLVM’s SmallVector class (among others), which
is basically a small-size-optimized version of std::vector, which, due
to constraints on iterators under std::move() semantics, can’t itself
be small-size optimized.</p>

<p>I picked up a new trick from this talk, which I’ll explain in C’s
terms. Suppose I have a dynamically growing buffer “vector” of <code class="language-plaintext highlighter-rouge">long</code>
values. I can keep pushing values into the buffer, doubling the
storage in size each time it fills. I’ll call this one “simple.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_simple</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Initialization is obvious. Though for easy overflow checks, and for
another reason I’ll explain later, I’m going to require the starting
size, <code class="language-plaintext highlighter-rouge">hint</code>, to be a power of two. It returns 1 on success and 0 on
error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_simple_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">hint</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">hint</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">hint</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">hint</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// power of 2</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">hint</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">*</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Pushing is straightforward, using realloc() when the buffer fills,
returning 0 for integer overflow or allocation failure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_simple_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// overflow</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// out of memory</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And finally, cleaning up. I hadn’t thought about this before, but if
the compiler manages to inline vec_simple_free(), that NULL pointer
assignment will probably get optimized out, possibly <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">even in the face
of a use-after-free bug</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_simple_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_simple</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// trap use-after-free bugs</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And finally an example of its use (without checking for errors).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">example</span><span class="p">(</span><span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">vec_simple</span> <span class="n">v</span><span class="p">;</span>
    <span class="n">vec_simple_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="kt">long</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">n</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">arg</span><span class="p">))</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">vec_simple_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="c1">// ... process vector ...</span>
    <span class="n">vec_simple_free</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the common case is only a handful of <code class="language-plaintext highlighter-rouge">long</code> values, and this
function is called frequently, we’re doing a lot of heap allocation
that could be avoided. Wouldn’t it be nice to put all that on the
stack?</p>

<h3 id="applying-small-size-optimization">Applying Small-Size Optimization</h3>

<p>Modify the struct to add this <code class="language-plaintext highlighter-rouge">temp</code> field. It’s probably obvious what
I’m getting at here. This is essentially the technique in SmallVector.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_small</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">temp</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">values</code> field is initially pointed at the small buffer. Notice
that unlike the “simple” vector above, this initialization function
cannot fail. It’s one less thing for the caller to check. It also
doesn’t take a <code class="language-plaintext highlighter-rouge">hint</code> since the buffer size is fixed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_small_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Pushing gets a little more complicated. If it’s the first time the
buffer has grown, the realloc() has to be done “manually” with
malloc() and memcpy().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_small_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// overflow</span>

        <span class="kt">void</span>  <span class="o">*</span><span class="n">new_values</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span> <span class="o">==</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* First time heap allocation. */</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">new_values</span><span class="p">)</span>
                <span class="n">memcpy</span><span class="p">(</span><span class="n">new_values</span><span class="p">,</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">));</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// out of memory</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, only call free() if the buffer was actually allocated on the
heap.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_small_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_small</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">!=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">temp</span><span class="p">)</span>
        <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If 99% of these vectors never exceed 16 elements, then 99% of the time
the heap isn’t touched. That’s <em>much</em> better than before. The 1% case
is still covered, too, at what is probably an insignificant cost.</p>

<p>An important difference to SmallVector is that they parameterize the
small buffer’s size through the template. In C we’re stuck with fixed
sizes or macro hacks. Or are we?</p>

<h3 id="using-a-caller-provided-buffer">Using a Caller-Provided Buffer</h3>

<p>This time remove the temporary buffer, making it look like the simple
vector from before.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vec_flex</span> <span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">count</span><span class="p">;</span>
    <span class="kt">long</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The user will provide the initial buffer, which will presumably be an
adjacent, stack-allocated array, but whose size is under the user’s
control.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_flex_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="o">*</span><span class="n">init</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">nmemb</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">);</span> <span class="c1">// we need that low bit!</span>
    <span class="n">assert</span><span class="p">(</span><span class="n">nmemb</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">nmemb</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// power of 2</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">nmemb</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">init</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The power of two size, greater than one, means the size will always be
an even number. Why is this important? There’s one piece of
information missing from the struct: Is the buffer currently heap
allocated or not? That’s just one bit of information, but adding just
one more bit to the struct will typically pad it out another 31 or 63
more bits. What a waste! Since I’m not using the lowest bit of the
size (always being an even number), I can smuggle it in there. Hence
the <code class="language-plaintext highlighter-rouge">nmemb | 1</code>, the 1 indicating that it’s not heap allocated.</p>

<p>When pushing, the <code class="language-plaintext highlighter-rouge">actual_size</code> is extracted by clearing the bottom
bit (<code class="language-plaintext highlighter-rouge">size &amp; ~1</code>) and the indicator bit is extracted with a 1 bit mask
(<code class="language-plaintext highlighter-rouge">size &amp; 1</code>). The bit is cleared by virtue of not intentionally
setting it again.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">vec_flex_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">,</span> <span class="kt">long</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">size_t</span> <span class="n">actual_size</span> <span class="o">=</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="mi">1</span><span class="p">;</span> <span class="c1">// clear bottom bit</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="n">actual_size</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">size_t</span> <span class="n">value_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
        <span class="kt">size_t</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">actual_size</span> <span class="o">*</span> <span class="mi">2</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_size</span> <span class="o">||</span> <span class="n">value_size</span> <span class="o">&gt;</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span> <span class="o">/</span> <span class="n">new_size</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* overflow */</span>

        <span class="kt">void</span> <span class="o">*</span><span class="n">new_values</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="cm">/* First time heap allocation. */</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">new_values</span><span class="p">)</span>
                <span class="n">memcpy</span><span class="p">(</span><span class="n">new_values</span><span class="p">,</span> <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">actual_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">new_values</span> <span class="o">=</span> <span class="n">realloc</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">,</span> <span class="n">new_size</span> <span class="o">*</span> <span class="n">value_size</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">new_values</span><span class="p">)</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* out of memory */</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">new_size</span><span class="p">;</span>
        <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">new_values</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">[</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Only free() when it’s been allocated, like before.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">vec_flex_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">vec_flex</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">))</span>
        <span class="n">free</span><span class="p">(</span><span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">);</span>
    <span class="n">v</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And here’s what it looks like in action.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">example</span><span class="p">(</span><span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">f</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">vec_flex</span> <span class="n">v</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">buffer</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">vec_flex_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buffer</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
    <span class="kt">long</span> <span class="n">n</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">n</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">arg</span><span class="p">))</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">vec_flex_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="c1">// ... process vector ...</span>
    <span class="n">vec_flex_free</span><span class="p">(</span><span class="o">&amp;</span><span class="n">v</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you were to log all vector sizes as part of profiling, and the
assumption about their typical small number of elements was correct,
you could easily tune the array size in each case to remove the vast
majority of vector heap allocations.</p>

<p>Now that I’ve learned this optimization trick, I’ll be looking out for
good places to apply it. It’s also a good reason for me to stop
abusing VLAs.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The Vulgarness of Abbreviated Function Templates</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/10/02/"/>
    <id>urn:uuid:048f746a-de7f-3357-8409-cfd531363726</id>
    <updated>2016-10-02T23:59:59Z</updated>
    <category term="c"/><category term="cpp"/><category term="rant"/><category term="lang"/>
    <content type="html">
      <![CDATA[<p>The <code class="language-plaintext highlighter-rouge">auto</code> keyword has been a part of C and C++ since the very
beginning, originally as a one of the four <em>storage class specifiers</em>:
<code class="language-plaintext highlighter-rouge">auto</code>, <code class="language-plaintext highlighter-rouge">register</code>, <code class="language-plaintext highlighter-rouge">static</code>, and <code class="language-plaintext highlighter-rouge">extern</code>. An <code class="language-plaintext highlighter-rouge">auto</code> variable has
“automatic storage duration,” meaning it is automatically allocated at
the beginning of its scope and deallocated at the end. It’s the
default storage class for any variable without external linkage or
without <code class="language-plaintext highlighter-rouge">static</code> storage, so the vast majority of variables in a
typical C program are automatic.</p>

<p>In C and C++ <em>prior to C++11</em>, the following definitions are
equivalent because the <code class="language-plaintext highlighter-rouge">auto</code> is implied.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">square</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x2</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">square</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">auto</span> <span class="kt">int</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As a holdover from <em>really</em> old school C, unspecified types in C are
implicitly <code class="language-plaintext highlighter-rouge">int</code>, and even today you can get away with weird stuff
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* C only */</span>
<span class="n">square</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">auto</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By “get away with” I mean in terms of the compiler accepting this as
valid input. Your co-workers, on the other hand, may become violent.</p>

<p>Like <code class="language-plaintext highlighter-rouge">register</code>, as a storage class <code class="language-plaintext highlighter-rouge">auto</code> is an historical artifact
without direct practical use in modern code. However, as a <em>concept</em>
it’s indispensable for the specification. In practice, automatic
storage means the variables lives on “the” stack (or <a href="http://clang.llvm.org/docs/SafeStack.html">one of the
stacks</a>), but the specifications make no mention of a
stack. In fact, the word “stack” doesn’t appear even once. Instead
it’s all described in terms of “automatic storage,” rightfully leaving
the details to the implementations. A stack is the most sensible
approach the vast majority of the time, particularly because it’s both
thread-safe and re-entrant.</p>

<h3 id="c11-type-inference">C++11 Type Inference</h3>

<p>One of the major changes in C++11 was repurposing the <code class="language-plaintext highlighter-rouge">auto</code> keyword,
moving it from a storage class specifier to a a <em>type specifier</em>. In
C++11, the compiler <strong>infers the type of an <code class="language-plaintext highlighter-rouge">auto</code> variable from its
initializer</strong>. In C++14, it’s also permitted for a function’s return
type, inferred from the <code class="language-plaintext highlighter-rouge">return</code> statement.</p>

<p>This new specifier is very useful in idiomatic C++ with its
ridiculously complex types. Transient variables, such as variables
bound to iterators in a loop, don’t need a redundant type
specification. It keeps code <em>DRY</em> (“Don’t Repeat Yourself”). Also,
templates easier to write, since it makes the compiler do more of the
work. The necessary type information is already semantically present,
and the compiler is a lot better at dealing with it.</p>

<p>With this change, the following is valid in both C and C++11, and, by
<em>sheer coincidence</em>, has the same meaning, but for entirely different
reasons.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">square</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">auto</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In C the type is implied as <code class="language-plaintext highlighter-rouge">int</code>, and in C++11 the type is inferred
from the type of <code class="language-plaintext highlighter-rouge">x * x</code>, which, in this case, is <code class="language-plaintext highlighter-rouge">int</code>. The prior
example with <code class="language-plaintext highlighter-rouge">auto int x2</code>, valid in C++98 and C++03, is no longer
valid in C++11 since <code class="language-plaintext highlighter-rouge">auto</code> and <code class="language-plaintext highlighter-rouge">int</code> are redundant type specifiers.</p>

<p>Occasionally I wish I had something like <code class="language-plaintext highlighter-rouge">auto</code> in C. If I’m writing a
<code class="language-plaintext highlighter-rouge">for</code> loop from 0 to <code class="language-plaintext highlighter-rouge">n</code>, I’d like the loop variable to be the same
type as <code class="language-plaintext highlighter-rouge">n</code>, even if I decide to change the type of <code class="language-plaintext highlighter-rouge">n</code> in the future.
For example,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">foo</span> <span class="o">*</span><span class="n">foo</span> <span class="o">=</span> <span class="n">foo_create</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">foo</span><span class="o">-&gt;</span><span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
    <span class="cm">/* ... */</span><span class="p">;</span>
</code></pre></div></div>

<p>The loop variable <code class="language-plaintext highlighter-rouge">i</code> should be the same type as <code class="language-plaintext highlighter-rouge">foo-&gt;n</code>. If I decide
to change the type of <code class="language-plaintext highlighter-rouge">foo-&gt;n</code> in the struct definition, I’d have to
find and update every loop. The idiomatic C solution is to <code class="language-plaintext highlighter-rouge">typedef</code>
the integer, using the new type both in the struct and in loops, but I
don’t think that’s much better.</p>

<h3 id="abbreviated-function-templates">Abbreviated Function Templates</h3>

<p>Why is all this important? Well, I was recently reviewing some C++ and
came across this odd specimen. I’d never seen anything like it before.
Notice the use of <code class="language-plaintext highlighter-rouge">auto</code> for the parameter types.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">set_odd</span><span class="p">(</span><span class="k">auto</span> <span class="n">first</span><span class="p">,</span> <span class="k">auto</span> <span class="n">last</span><span class="p">,</span> <span class="k">const</span> <span class="k">auto</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">bool</span> <span class="n">toggle</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">first</span> <span class="o">!=</span> <span class="n">last</span><span class="p">;</span> <span class="n">first</span><span class="o">++</span><span class="p">,</span> <span class="n">toggle</span> <span class="o">=</span> <span class="o">!</span><span class="n">toggle</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">toggle</span><span class="p">)</span>
            <span class="o">*</span><span class="n">first</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Given the other uses of <code class="language-plaintext highlighter-rouge">auto</code> as a type specifier, this kind of makes
sense, right? The compiler infers the type from the input argument.
But, as you should often do, put yourself in the compiler’s shoes for
a moment. Given this function definition in isolation, can you
generate any code? Nope. The compiler needs to see the call site
before it can infer the type. Even more, different call sites may use
different types. That <strong>sounds an awful lot like a template</strong>, eh?</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">,</span> <span class="k">typename</span> <span class="nc">V</span><span class="p">&gt;</span>
<span class="kt">void</span>
<span class="nf">set_odd</span><span class="p">(</span><span class="n">T</span> <span class="n">first</span><span class="p">,</span> <span class="n">T</span> <span class="n">last</span><span class="p">,</span> <span class="k">const</span> <span class="n">V</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">bool</span> <span class="n">toggle</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">first</span> <span class="o">!=</span> <span class="n">last</span><span class="p">;</span> <span class="n">first</span><span class="o">++</span><span class="p">,</span> <span class="n">toggle</span> <span class="o">=</span> <span class="o">!</span><span class="n">toggle</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">toggle</span><span class="p">)</span>
            <span class="o">*</span><span class="n">first</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is <strong>a proposed feature called <em>abbreviated function
templates</em></strong>, part of <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4361.pdf"><em>C++ Extensions for Concepts</em></a>. It’s
intended to be shorthand for the template version of the function. GCC
4.9 implements it as an extension, which is why the author was unaware
of its unofficial status. In March 2016 it was established that
<a href="http://honermann.net/blog/2016/03/06/why-concepts-didnt-make-cxx17/">abbreviated function templates <strong>would <em>not</em> be part of
C++17</strong></a>, but may still appear in a future revision.</p>

<p>Personally, I find this use of <code class="language-plaintext highlighter-rouge">auto</code> to be vulgar. It overloads the
keyword with a third definition. This isn’t unheard of — <code class="language-plaintext highlighter-rouge">static</code> also
serves a number of unrelated purposes — but while similar to the
second form of <code class="language-plaintext highlighter-rouge">auto</code> (type inference), this proposed third form is
very different in its semantics (far more complex) and overhead
(potentially very costly). I’m glad it’s been rejected so far.
Templates better reflect the nature of this sort of code.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Linux System Calls, Error Numbers, and In-Band Signaling</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/23/"/>
    <id>urn:uuid:ee8b3af5-ce09-3f9a-ef9c-0d95807bf95e</id>
    <updated>2016-09-23T01:07:40Z</updated>
    <category term="linux"/><category term="x86"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Today I got an e-mail asking about a previous article on <a href="/blog/2015/05/15/">creating
threads on Linux using raw system calls</a> (specifically x86-64).
The questioner was looking to use threads in a program without any
libc dependency. However, he was concerned about checking for mmap(2)
errors when allocating the thread’s stack. The <a href="http://man7.org/linux/man-pages/man2/mmap.2.html">mmap(2) man
page</a> says it returns -1 (a.k.a. <code class="language-plaintext highlighter-rouge">MAP_FAILED</code>) on error and sets
errno. But how do you check errno without libc?</p>

<p>As a reminder here’s what the (unoptimized) assembly looks like.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>As usual, the system call return value is in <code class="language-plaintext highlighter-rouge">rax</code>, which becomes the
return value for <code class="language-plaintext highlighter-rouge">stack_create()</code>. Again, its C prototype would look
like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>If you were to, say, intentionally botch the arguments to force an
error, you might notice that the system call isn’t returning -1, but
other negative values. What gives?</p>

<p>The trick is that <strong>errno is a C concept</strong>. That’s why it’s documented
as <a href="http://man7.org/linux/man-pages/man3/errno.3.html">errno(3)</a> — the 3 means it belongs to C. Just think about
how messy this thing is: it’s a thread-local value living in the
application’s address space. The kernel rightfully has nothing to do
with it. Instead, the mmap(2) wrapper in libc assigns errno (if
needed) after the system call returns. This is how <a href="http://man7.org/linux/man-pages/man2/intro.2.html"><em>all</em> system calls
through libc work</a>, even with the <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">syscall(2)
wrapper</a>.</p>

<p>So how does the kernel report the error? It’s an old-fashioned return
value. If you have any doubts, take it straight from the horse’s
mouth: <a href="http://lxr.free-electrons.com/source/mm/mmap.c?v=4.6#L1143">mm/mmap.c:do_mmap()</a>. Here’s a sample of return
statements.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

<span class="cm">/* Careful about overflows.. */</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">PAGE_ALIGN</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

<span class="cm">/* offset overflow? */</span>
<span class="k">if</span> <span class="p">((</span><span class="n">pgoff</span> <span class="o">+</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">))</span> <span class="o">&lt;</span> <span class="n">pgoff</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EOVERFLOW</span><span class="p">;</span>

<span class="cm">/* Too many mappings? */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">mm</span><span class="o">-&gt;</span><span class="n">map_count</span> <span class="o">&gt;</span> <span class="n">sysctl_max_map_count</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
</code></pre></div></div>

<p>It’s returning the negated error number. Simple enough.</p>

<p>If you think about it a moment, you might notice a complication: This
is a form of in-band signaling. On success, mmap(2) returns a memory
address. All those negative error numbers are potentially addresses
that a caller might want to map. How can we tell the difference?</p>

<p>1) None of the possible error numbers align on a page boundary, so
   they’re not actually valid return values. NULL <em>does</em> lie on a page
   boundary, which is one reason why it’s not used as an error return
   value for mmap(2). The other is that you might actually want to map
   NULL, for better <a href="https://blogs.oracle.com/ksplice/entry/much_ado_about_null_exploiting1">or worse</a>.</p>

<p>2) Those low negative values lie in a region of virtual memory
   reserved exclusively for the kernel (sometimes called “<a href="https://linux-mm.org/HighMemory">low
   memory</a>”). On x86-64, any address with the most significant
   bit set (i.e. the sign bit of a signed integer) is one of these
   addresses. Processes aren’t allowed to map these addresses, and so
   mmap(2) will never return such a value on success.</p>

<p>So what’s a clean, safe way to go about checking for error values?
It’s a lot easier to read <a href="https://www.musl-libc.org/">musl</a> than glibc, so let’s take a
peek at how musl does it in its own mmap: <a href="https://git.musl-libc.org/cgit/musl/tree/src/mman/mmap.c?h=v1.1.15">src/mman/mmap.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">&amp;</span> <span class="n">OFF_MASK</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">EINVAL</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">&gt;=</span> <span class="n">PTRDIFF_MAX</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">errno</span> <span class="o">=</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">MAP_FAILED</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_FIXED</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">__vm_wait</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">syscall</span><span class="p">(</span><span class="n">SYS_mmap</span><span class="p">,</span> <span class="n">start</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
</code></pre></div></div>

<p>Hmm, it looks like its returning the result directly. What happened
to setting errno? Well, syscall() is actually a macro that runs the
result through __syscall_ret().</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))
</span></code></pre></div></div>

<p>Looking a little deeper: <a href="https://git.musl-libc.org/cgit/musl/tree/src/internal/syscall_ret.c?h=v1.1.15">src/internal/syscall_ret.c</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">__syscall_ret</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">r</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&gt;</span> <span class="o">-</span><span class="mi">4096UL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">errno</span> <span class="o">=</span> <span class="o">-</span><span class="n">r</span><span class="p">;</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Bingo. As documented, if the value falls within that “high” (unsigned)
range of negative values for <em>any</em> system call, it’s an error number.</p>

<p>Getting back to the original question, we could employ this same check
in the assembly code. However, since this is a anonymous memory map
with a kernel-selected address, <strong>there’s only one possible error:
ENOMEM</strong> (12). This error happens if the maximum number of memory maps
has been reached, or if there’s no contiguous region available for the
4MB stack. The check will only need to test the result against -12.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Modifying the Middle of a zlib Stream</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/09/"/>
    <id>urn:uuid:804f0de4-93c7-3a70-d0d5-2b3b7192491f</id>
    <updated>2016-09-09T03:37:03Z</updated>
    <category term="c"/><category term="compression"/>
    <content type="html">
      <![CDATA[<p>I recently ran into problem where I needed to modify bytes at the
beginning of an existing <a href="http://www.zlib.net/">zlib</a> stream. My program creates a
file in a format I do not control, and the file format has a header
indicating the total, uncompressed data size, followed immediately by
the data. The tricky part is that the <strong>header and data are zlib
compressed together</strong>, and I don’t know how much data there will be
until I’ve collected it all. Sometimes it’s many gigabytes. I don’t
know how to fill out the header when I start, and I can’t rewrite it
when I’m done since it’s compressed in the zlib stream … <em>or so I
thought</em>.</p>

<svg version="1.1" height="50" width="600">
  <rect fill="#dfd" width="149" height="48" x="1" y="1" stroke="black" stroke-width="2" />
  <rect fill="#ddf" width="449" height="48" x="150" y="1" stroke="black" stroke-width="2" />
  <text x="75" y="25" text-anchor="middle" dominant-baseline="central" font-size="22px" font-family="sans-serif">
    nelem
  </text>
  <text x="170" y="25" text-anchor="start" dominant-baseline="central" font-size="22px" font-family="sans-serif">
    samples[nelem]
  </text>
</svg>

<p>My original solution was not to compress anything until it gathered
the entirety of the data. The input would get concatenated into a huge
buffer, then finally compressed and written out. It’s not ideal,
because the program uses a lot more memory than it theoretical could,
especially if the data is highly compressible. It would be far better
to compress the data as it arrives and somehow update the header
later.</p>

<p>My first thought was to ask zlib to leave the header uncompressed,
then enable compression (<code class="language-plaintext highlighter-rouge">deflateParams()</code>) for the data. I’d work out
the magic offset and overwrite the uncompressed header bytes once I’m
done. There are two major issues with this, and I’ll address each:</p>

<ul>
  <li>
    <p>zlib includes a checksum (<a href="https://en.wikipedia.org/wiki/Adler-32">adler32</a>) at the end of the
data, and editing the stream would cause a mismatch. This fairly
easy to correct thanks to adler32’s properties.</p>
  </li>
  <li>
    <p>zlib is an LZ77-family compressor and <a href="/blog/2014/11/22/">compression comes from
back-references</a> into past (and sometimes future) bytes of
decompressed output. Up to 32kB of data following the header could
reference bytes in the header as a dictionary. I would need to ask
zlib not to reference these bytes. Fortunately the zlib API is
intentionally designed for this, though for different purposes.</p>
  </li>
</ul>

<h3 id="fixing-the-checksum">Fixing the checksum</h3>

<p>Ignoring the second problem for a moment, I could fix the checksum by
computing it myself. When I overwrite my uncompressed header bytes, I
could also overwrite the checksum at the end of the compressed stream.
For illustration, here’s an simple example implementation of adler32
(from Wikipedia):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define MOD_ADLER 65521
</span>
<span class="kt">uint32_t</span>
<span class="nf">example_adler32</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">uint32_t</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">len</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">%</span> <span class="n">MOD_ADLER</span><span class="p">;</span>
        <span class="n">b</span> <span class="o">=</span> <span class="p">(</span><span class="n">b</span> <span class="o">+</span> <span class="n">a</span><span class="p">)</span> <span class="o">%</span> <span class="n">MOD_ADLER</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="n">a</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you think about this for a moment, you may notice this puts me back
at square one. If I don’t know the header, then I don’t know the
checksum value at the end of the header, going into the data buffer.
I’d need to buffer all the data to compute the checksum. Fortunately
adler32 has the nice property that <strong>two checksums can be concatenated
as if they were one long stream</strong>. In a malicious context this is
known as a <a href="https://en.wikipedia.org/wiki/Length_extension_attack">length extension attack</a>, but it’s a real benefit
here.</p>

<p>It’s like the zlib authors anticipated my needs, because the zlib
library has a function <em>exactly</em> for this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="nf">adler32_combine</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">adler1</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">adler2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len2</span><span class="p">);</span>
</code></pre></div></div>

<p>I just have to keep track of the data checksum <code class="language-plaintext highlighter-rouge">adler2</code> and I can
compute the proper checksum later.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">data_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// initial value</span>
<span class="k">while</span> <span class="p">(</span><span class="n">processing_input</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">data_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">data_adler</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="n">total</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// ...</span>
<span class="kt">uint32_t</span> <span class="n">header_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">header_adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">header_adler</span><span class="p">,</span> <span class="n">header</span><span class="p">,</span> <span class="n">header_size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">adler</span> <span class="o">=</span> <span class="n">adler32_combine</span><span class="p">(</span><span class="n">header_adler</span><span class="p">,</span> <span class="n">data_adler</span><span class="p">,</span> <span class="n">total</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="preventing-back-references">Preventing back-references</h3>

<p>This part is more complicated and it helps to have some familiarity
with zlib. Every time zlib is asked to compress data, it’s given a
<a href="http://www.bolet.org/~pornin/deflate-flush.html">flush parameter</a>. Under normal operation, this value is always
<code class="language-plaintext highlighter-rouge">Z_NO_FLUSH</code> until the end of the stream, in which case it’s finalized
with <code class="language-plaintext highlighter-rouge">Z_FINISH</code>. Other flushing options force it to emit data sooner
at the cost of reduced compression ratio. This would primarily be used
to eliminate output latency on interactive streams (e.g. compressed
SSH sessions).</p>

<p>The necessary flush option for this situation is <code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code>. It
forces out all output data and resets the dictionary: a fence.
<strong>Future inputs cannot reference anything before a full flush.</strong> Since
the header is uncompressed, it will not reference itself either.
Ignoring the checksum problem, I can safely modify these bytes.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>To fully demonstrate all of this, I’ve put together an example using
one of my favorite image formats, <a href="https://en.wikipedia.org/wiki/Netpbm_format">Netpbm P6</a>.</p>

<ul>
  <li><a href="https://github.com/skeeto/zlib-mutate-demo">https://github.com/skeeto/zlib-mutate-demo</a></li>
</ul>

<p>In the P6 format, the image header is an ASCII description of the
image’s dimensions followed immediately by raw pixel data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>P6
width height
depth
[RGB bytes]
</code></pre></div></div>

<p>It’s a bit contrived, but it’s the project I used to work it all out.
The demo reads arbitrary raw byte data on standard input and uses it
to produce a zlib-compressed PPM file on standard output. It doesn’t
know the size of the input ahead of time, nor does it naively buffer
it all. There’s no dynamic allocation (except for what zlib does
internally), but the program can process arbitrarily large input. The
only requirement is that <strong>standard output is seekable</strong>. Using the
technique described above, it patches the header within the zlib
stream with the final image dimensions after the input has been
exhausted.</p>

<p>If you’re on a Debian system, you can use <code class="language-plaintext highlighter-rouge">zlib-flate</code> to decompress
raw zlib streams (gzip wraps zlib, but can’t raw zlib). Alternatively
your system’s <code class="language-plaintext highlighter-rouge">openssl</code> program may have zlib support. Here’s running
it on itself as input. Remember, you can’t pipe it into zlib-flate
because the output needs to be seekable in order to write the header.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./zppm &lt; zppm &gt; out.ppmz
$ zlib-flate -uncompress &lt; out.ppmz &gt; out.ppm
</code></pre></div></div>

<p><img src="/img/zppm.png" alt="" /></p>

<p>Unfortunately due to the efficiency-mindedness of zlib, its use
requires careful bookkeeping that’s easy to get wrong. It’s a little
machine that at each step needs to be either fed more input or its
output buffer cleared. Even with all the error checking stripped away,
it’s still too much to go over in full here, but I’ll summarize the
parts.</p>

<p>First I process an empty buffer with compression disabled. The output
buffer will be discarded, so input buffer could be left uninitialized,
but I don’t want to <a href="http://valgrind.org/">upset anyone</a>. All I need is the output
size, which I use to seek over the to-be-written header. I use
<code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code> as described, and there’s no loop because I presume my
output buffer is easily big enough for this.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">bufin</span><span class="p">[</span><span class="mi">4096</span><span class="p">];</span>
<span class="kt">char</span> <span class="n">bufout</span><span class="p">[</span><span class="mi">4096</span><span class="p">];</span>

<span class="n">z_stream</span> <span class="n">z</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">next_in</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_in</span> <span class="o">=</span> <span class="n">HEADER_SIZE</span><span class="p">,</span>
    <span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">),</span>
<span class="p">};</span>
<span class="n">deflateInit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_NO_COMPRESSION</span><span class="p">);</span>
<span class="n">memset</span><span class="p">(</span><span class="n">bufin</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_FULL_FLUSH</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">SEEK_SET</span><span class="p">);</span>
</code></pre></div></div>

<p>Next I enable compression and reset the checksum. This makes zlib
track the checksum for all of the non-header input. Otherwise I’d be
throwing away all its checksum work and repeating it myself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">deflateParams</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_BEST_COMPRESSION</span><span class="p">,</span> <span class="n">Z_DEFAULT_STRATEGY</span><span class="p">);</span>
<span class="n">z</span><span class="p">.</span><span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>I won’t include it in this article, but what follows is a standard
zlib compression loop, consuming all the input data. There’s one key
difference compared to a normal zlib compression loop: when the input
is exhausted, instead of <code class="language-plaintext highlighter-rouge">Z_FINISH</code> I use <code class="language-plaintext highlighter-rouge">Z_SYNC_FLUSH</code> to force
everything out. The problem with <code class="language-plaintext highlighter-rouge">Z_FINISH</code> is that it will write the
checksum, but we’re not ready for that.</p>

<p>With all the input processed, it’s time to go back to rewrite the
header. Rather than mess around with magic byte offsets, I start a
second, temporary zlib stream and do the <code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code> like before,
but this time with the real header. In deciding the header size, I
reserved 6 characters for the width and 10 characters for the height.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sprintf</span><span class="p">(</span><span class="n">bufin</span><span class="p">,</span> <span class="s">"P6</span><span class="se">\n</span><span class="s">%-6lu</span><span class="se">\n</span><span class="s">%-10lu</span><span class="se">\n</span><span class="s">255</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">adler</span> <span class="o">=</span> <span class="n">adler32</span><span class="p">(</span><span class="n">adler</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>

<span class="n">z_stream</span> <span class="n">zh</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">next_in</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufin</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_in</span> <span class="o">=</span> <span class="n">HEADER_SIZE</span><span class="p">,</span>
    <span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">,</span>
    <span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">),</span>
<span class="p">};</span>
<span class="n">deflateInit</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">,</span> <span class="n">Z_NO_COMPRESSION</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">,</span> <span class="n">Z_FULL_FLUSH</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">SEEK_SET</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">bufout</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">zh</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="n">fseek</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">SEEK_END</span><span class="p">);</span>
<span class="n">deflateEnd</span><span class="p">(</span><span class="o">&amp;</span><span class="n">zh</span><span class="p">);</span>
</code></pre></div></div>

<p>The header is now complete, so I can go back to finish the original
compression stream. Again, I assume the output buffer is big enough
for these final bytes.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">z</span><span class="p">.</span><span class="n">adler</span> <span class="o">=</span> <span class="n">adler32_combine</span><span class="p">(</span><span class="n">adler</span><span class="p">,</span> <span class="n">z</span><span class="p">.</span><span class="n">adler</span><span class="p">,</span> <span class="n">z</span><span class="p">.</span><span class="n">total_in</span> <span class="o">-</span> <span class="n">HEADER_SIZE</span><span class="p">);</span>
<span class="n">z</span><span class="p">.</span><span class="n">next_out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">bufout</span><span class="p">;</span>
<span class="n">z</span><span class="p">.</span><span class="n">avail_out</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">);</span>
<span class="n">deflate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">,</span> <span class="n">Z_FINISH</span><span class="p">);</span>
<span class="n">fwrite</span><span class="p">(</span><span class="n">bufout</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bufout</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="p">.</span><span class="n">avail_out</span><span class="p">,</span> <span class="n">stdout</span><span class="p">);</span>
<span class="n">deflateEnd</span><span class="p">(</span><span class="o">&amp;</span><span class="n">z</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s a lot more code than I expected, but it wasn’t too hard to work
out. If you want to get into the nitty gritty and <em>really</em> hack a zlib
stream, check out <a href="https://tools.ietf.org/html/rfc1950">RFC 1950</a> and <a href="https://tools.ietf.org/html/rfc1951">RFC 1951</a>.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Inspecting C's qsort Through Animation</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/05/"/>
    <id>urn:uuid:7d86c669-ff40-3210-7e28-78b801e35e50</id>
    <updated>2016-09-05T21:17:11Z</updated>
    <category term="c"/><category term="linux"/><category term="media"/><category term="video"/>
    <content type="html">
      <![CDATA[<p>The C standard library includes a qsort() function for sorting
arbitrary buffers given a comparator function. The name comes from its
<a href="https://gallium.inria.fr/~maranget/X/421/09/bentley93engineering.pdf">original Unix implementation, “quicker sort,”</a> a variation of
the well-known quicksort algorithm. The C standard doesn’t specify an
algorithm, except to say that it may be unstable (C99 §7.20.5.2¶4) —
equal elements have an unspecified order. As such, different C
libraries use different algorithms, and even when using the same
algorithm they make different implementation trade-offs.</p>

<p>I added a drawing routine to a comparison function to see what the
sort function was doing for different C libraries. Every time it’s
called for a comparison, it writes out a snapshot of the array as a
Netpbm PPM image. It’s <a href="/blog/2011/11/28/">easy to turn concatenated PPMs into a GIF or
video</a>. Here’s my code if you want to try it yourself:</p>

<ul>
  <li><a href="/download/qsort-animate.c">qsort-animate.c</a></li>
</ul>

<p>Adjust the parameters at the top to taste. Rather than call rand() in
the standard library, I included xorshift64star() with a hard-coded
seed so that the array will be shuffled exactly the same across all
platforms. This makes for a better comparison.</p>

<p>To get an optimized GIF on unix-like systems, run it like so.
(Microsoft’s <a href="https://web.archive.org/web/20161126142829/http://radiance-online.org:82/pipermail/radiance-dev/2016-March/001578.html">UCRT currently has serious bugs</a> with pipes, so it
was run differently in that case.)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./a.out | convert -delay 10 ppm:- gif:- | gifsicle -O3 &gt; sort.gif
</code></pre></div></div>

<p>The number of animation frames reflects the efficiency of the sort,
but this isn’t really a benchmark. The input array is fully shuffled,
and real data often not. For a benchmark, have a look at <a href="http://calmerthanyouare.org/2013/05/31/qsort-shootout.html">a libc
qsort() shootout of sorts</a> instead.</p>

<p>To help you follow along, <strong>clicking on any animation will restart it.</strong></p>

<h3 id="glibc">glibc</h3>

<p><img src="/img/qsort/glibc.gif" alt="" class="resetable" title="glibc" /></p>

<p>Sorted in <strong>307 frames</strong>. glibc prefers to use mergesort, which,
unlike quicksort, isn’t an in-place algorithm, so it has to allocate
memory. That allocation could fail for huge arrays, and, since qsort()
can’t fail, it uses quicksort as a backup. You can really see the
mergesort in action: changes are made that we cannot see until later,
when it’s copied back into the original array.</p>

<h3 id="dietlibc-032">dietlibc (0.32)</h3>

<p>Sorted in <strong>503 frames</strong>. <a href="https://www.fefe.de/dietlibc/">dietlibc</a> is an alternative C
standard library for Linux. It’s optimized for size, which shows
through its slower performance. It looks like a quicksort that always
chooses the last element as the pivot.</p>

<p><img src="/img/qsort/diet.gif" alt="" class="resetable" title="diet" /></p>

<p>Update: Felix von Leitner, the primary author of dietlibc, has alerted
me that, as of version 0.33, it now chooses a random pivot. This
comment from the source describes it:</p>

<blockquote>
  <p>We chose the rightmost element in the array to be sorted as pivot,
which is OK if the data is random, but which is horrible if the data
is already sorted. Try to improve by exchanging it with a random
other pivot.</p>
</blockquote>

<h3 id="musl">musl</h3>

<p>Sort in <strong>637 frames</strong>. <a href="https://www.musl-libc.org/">musl libc</a> is another alternative C
standard library for Linux. It’s my personal preference when I
statically link Linux binaries. Its qsort() looks a lot like a heapsort,
and with some research I see it’s actually <a href="http://www.keithschwarz.com/smoothsort/">smoothsort</a>, a
heapsort variant.</p>

<p><img src="/img/qsort/musl.gif" alt="" class="resetable" title="musl" /></p>

<h3 id="bsd">BSD</h3>

<p>Sorted in <strong>354 frames</strong>. I ran it on both OpenBSD and FreeBSD with
identical results, so, unsurprisingly, they share an implementation.
It’s quicksort, and what’s neat about it is at the beginning you can
see it searching for a median for use as the pivot. This helps avoid
the O(n^2) worst case.</p>

<p><img src="/img/qsort/bsd-qsort.gif" alt="" class="resetable" title="BSD qsort" /></p>

<p>BSD also includes a mergesort() with the same prototype, except with
an <code class="language-plaintext highlighter-rouge">int</code> return for reporting failures. This one sorted in <strong>247
frames</strong>. Like glibc before, there’s some behind-the-scenes that isn’t
captured. But even more, notice how the markers disappear during the
merge? It’s running the comparator against copies, stored outside the
original array. Sneaky!</p>

<p><img src="/img/qsort/bsd-mergesort.gif" alt="" class="resetable" title="BSD mergesort" /></p>

<p>Again, BSD also includes heapsort(), so ran that too. It sorted in
<strong>418 frames</strong>. It definitely looks like a heapsort, and the worse
performance is similar to musl. It seems heapsort is a poor fit for
this data.</p>

<p><img src="/img/qsort/bsd-heapsort.gif" alt="" class="resetable" title="BSD heapsort" /></p>

<h3 id="cygwin">Cygwin</h3>

<p>It turns out Cygwin borrowed its qsort() from BSD. It’s pixel
identical to the above. I hadn’t noticed until I looked at the frame
counts.</p>

<p><img src="/img/qsort/cygwin.gif" alt="" class="resetable" title="Cygwin (BSD)" /></p>

<h3 id="msvcrtdll-mingw-and-ucrt-visual-studio">MSVCRT.DLL (MinGW) and UCRT (Visual Studio)</h3>

<p>MinGW builds against MSVCRT.DLL, found on every Windows system despite
its <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">unofficial status</a>. Until recently Microsoft didn’t
include a C standard library as part of the OS, but that changed with
their <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/vcblog/2015/03/03/introducing-the-universal-crt/">Universal CRT (UCRT) announcement</a>. I thought I’d try
them both.</p>

<p>Turns out they borrowed their old qsort() for the UCRT, and the result
is the same: sorted in <strong>417 frames</strong>. It chooses a pivot from the
median of the ends and the middle, swaps the pivot to the middle, then
partitions. Looking to the middle for the pivot makes sorting
pre-sorted arrays much more efficient.</p>

<p><img src="/img/qsort/ucrt.gif" alt="" class="resetable" title="Microsoft UCRT" /></p>

<h3 id="pelles-c">Pelles C</h3>

<p>Finally I ran it against <a href="http://www.smorgasbordet.com/pellesc/">Pelles C</a>, a C compiler for
Windows. It sorted in <strong>463 frames</strong>. I can’t find any information
about it, but it looks like some sort of hybrid between quicksort and
insertion sort. Like BSD qsort(), it finds a good median for the
pivot, partitions the elements, and if a partition is small enough, it
switches to insertion sort. This should behave well on mostly-sorted
arrays, but poorly on well-shuffled arrays (like this one).</p>

<p><img src="/img/qsort/pellesc.gif" alt="" class="resetable" title="Pelles C" /></p>

<h3 id="more-implementations">More Implementations</h3>

<p>That’s everything that was readily accessible to me. If you can run it
against something new, I’m certainly interested in seeing more
implementations.</p>

<script type="text/javascript">
(function() {
    var r = document.querySelectorAll('.resetable');
    for (var i = 0; i < r.length; i++) {
        r[i].onclick = function() {
            var src = this.src;
            var height = this.height;
            this.src = "";
            this.height = height;
            // setTimeout() required for IE
            var _this = this;
            setTimeout(function() { _this.src = src; }, 0);
        };
    }
}());
</script>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to Read and Write Other Process Memory</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/09/03/"/>
    <id>urn:uuid:205f20eb-a47e-3506-fd8f-4b416fc08133</id>
    <updated>2016-09-03T21:53:26Z</updated>
    <category term="win32"/><category term="linux"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>I recently put together a little game memory cheat tool called
<a href="https://github.com/skeeto/memdig">MemDig</a>. It can find the address of a particular game value
(score, lives, gold, etc.) after being given that value at different
points in time. With the address, it can then modify that value to
whatever is desired.</p>

<p>I’ve been using tools like this going back 20 years, but I never tried
to write one myself until now. There are many memory cheat tools to
pick from these days, the most prominent being <a href="http://www.cheatengine.org/">Cheat Engine</a>.
These tools use the platform’s debugging API, so of course any good
debugger could do the same thing, though a debugger won’t be
specialized appropriately (e.g. locating the particular address and
locking its value).</p>

<p>My motivation was bypassing an in-app purchase in a single player
Windows game. I wanted to convince the game I had made the purchase
when, in fact, I hadn’t. Once I had it working successfully, I ported
MemDig to Linux since I thought it would be interesting to compare.
I’ll start with Windows for this article.</p>

<h3 id="windows">Windows</h3>

<p>Only three Win32 functions are needed, and you could almost guess at
how it works.</p>

<ul>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms684320">OpenProcess()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms680553">ReadProcessMemory()</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms681674">WriteProcessMemory()</a></li>
</ul>

<p>It’s very straightforward <s>and, for this purpose, is probably the
simplest API for any platform</s> (see update).</p>

<p>As you probably guessed, you first need to open the process, given its
process ID (integer). You’ll need to select the <em>desired access</em> bit a
bit set. To read memory, you need the <code class="language-plaintext highlighter-rouge">PROCESS_VM_READ</code> and
<code class="language-plaintext highlighter-rouge">PROCESS_QUERY_INFORMATION</code> rights. To write memory, you need the
<code class="language-plaintext highlighter-rouge">PROCESS_VM_WRITE</code> and <code class="language-plaintext highlighter-rouge">PROCESS_VM_OPERATION</code> rights. Alternatively
you could just ask for all rights with <code class="language-plaintext highlighter-rouge">PROCESS_ALL_ACCESS</code>, but I
prefer to be precise.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">PROCESS_VM_READ</span> <span class="o">|</span>
               <span class="n">PROCESS_QUERY_INFORMATION</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_WRITE</span> <span class="o">|</span>
               <span class="n">PROCESS_VM_OPERATION</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">proc</span> <span class="o">=</span> <span class="n">OpenProcess</span><span class="p">(</span><span class="n">access</span><span class="p">,</span> <span class="n">FALSE</span><span class="p">,</span> <span class="n">pid</span><span class="p">);</span>
</code></pre></div></div>

<p>And then to read or write:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span> <span class="c1">// target process address</span>
<span class="n">SIZE_T</span> <span class="n">written</span><span class="p">;</span>
<span class="n">ReadProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">WriteProcessMemory</span><span class="p">(</span><span class="n">proc</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">written</span><span class="p">);</span>
</code></pre></div></div>

<p>Don’t forget to check the return value and verify <code class="language-plaintext highlighter-rouge">written</code>. Finally,
don’t forget to <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211">close it</a> when you’re done.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CloseHandle</span><span class="p">(</span><span class="n">proc</span><span class="p">);</span>
</code></pre></div></div>

<p>That’s all there is to it. For the full cheat tool you’d need to find
the mapped regions of memory, via <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa366907">VirtualQueryEx</a>. It’s not
as simple, but I’ll leave that for another article.</p>

<h3 id="linux">Linux</h3>

<p>Unfortunately there’s no standard, cross-platform debugging API for
unix-like systems. Most have a ptrace() system call, though each works
a little differently. Note that ptrace() is not part of POSIX, but
appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere.
The following will all be specific to Linux, though the procedure is
similar on other unix-likes.</p>

<p>In typical Linux fashion, if it involves other processes, you use the
standard file API on the /proc filesystem. Each process has a
directory under /proc named as its process ID. In this directory is a
virtual file called “mem”, which is a file view of that process’
entire address space, including unmapped regions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">file</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s">"/proc/%ld/mem"</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">pid</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">O_RDWR</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that while you can open this file, you can’t actually
read or write on that file without attaching to the process as a
debugger. You’ll just get EIO errors. To attach, use ptrace() with
<code class="language-plaintext highlighter-rouge">PTRACE_ATTACH</code>. This asynchronously delivers a <code class="language-plaintext highlighter-rouge">SIGSTOP</code> signal to
the target, which has to be waited on with waitpid().</p>

<p>You could select the target address with <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/lseek.html">lseek()</a>, but it’s
cleaner and more efficient just to do it all in one system call with
<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html">pread()</a> and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">pwrite()</a>. I’ve left out the error
checking, but the return value of each function should be checked:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_ATTACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">waitpid</span><span class="p">(</span><span class="n">pid</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

<span class="kt">off_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="p">...;</span> <span class="c1">// target process address</span>
<span class="n">pread</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>
<span class="c1">// or</span>
<span class="n">pwrite</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">value</span><span class="p">),</span> <span class="n">addr</span><span class="p">);</span>

<span class="n">ptrace</span><span class="p">(</span><span class="n">PTRACE_DETACH</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>The process will (and must) be stopped during this procedure, so do
your reads/writes quickly and get out. The kernel will deliver the
writes to the other process’ virtual memory.</p>

<p>Like before, don’t forget to close.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>To find the mapped regions in the real cheat tool, you would read and
parse the virtual text file /proc/<em>pid</em>/maps. I don’t know if I’d call
this stringly-typed method elegant — the kernel converts the data into
string form and the caller immediately converts it right back — but
that’s the official API.</p>

<p>Update: Konstantin Khlebnikov has pointed out the
<a href="http://man7.org/linux/man-pages/man2/process_vm_readv.2.html">process_vm_readv()</a> and <a href="http://man7.org/linux/man-pages/man2/process_vm_writev.2.html">process_vm_writev()</a>
system calls, available since Linux 3.2 (January 2012) and glibc 2.15
(March 2012). These system calls do not require ptrace(), nor does the
remote process need to be stopped. They’re equivalent to
ReadProcessMemory() and WriteProcessMemory(), except there’s no
requirement to first “open” the process.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Automatic Deletion of Incomplete Output Files</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/07/"/>
    <id>urn:uuid:431fafe9-6630-363e-4596-85eb3a289ec2</id>
    <updated>2016-08-07T02:00:37Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/><category term="linux"/>
    <content type="html">
      <![CDATA[<p>Conventionally, a program that creates an output file will delete its
incomplete output should an error occur while writing the file. It’s
risky to leave behind a file that the user may rightfully confuse for
a valid file. They might not have noticed the error.</p>

<p>For example, compression programs such as gzip, bzip2, and xz when
given a compressed file as an argument will create a new file with the
compression extension removed. They write to this file as the
compressed input is being processed. If the compressed stream contains
an error in the middle, the partially-completed output is removed.</p>

<p>There are exceptions of course, such as programs that download files
over a network. The partial result has value, especially if the
transfer can be <a href="https://tools.ietf.org/html/rfc7233">continued from where it left off</a>. The
convention is to append another extension, such as “.part”, to
indicate a partial output.</p>

<p>The straightforward solution is to always delete the file as part of
error handling. A non-interactive program would report the error on
standard error, delete the file, and exit with an error code. However,
there are at least two situations where error handling would be unable
to operate: unhandled signals (usually including a segmentation fault)
and power failures. A partial or corrupted output file will be left
behind, possibly looking like a valid file.</p>

<p>A common, more complex approach is to name the file differently from
its final name while being written. If written successfully, the
completed file is renamed into place. This is already <a href="http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/">required for
durable replacement</a>, so it’s basically free for many
applications. In the worst case, where the program is unable to clean
up, the obviously incomplete file is left behind only wasting space.</p>

<p>Looking to be more robust, I had the following misguided idea: <strong>Rely
completely on the operating system to perform cleanup in the case of a
failure.</strong> Initially the file would be configured to be automatically
deleted when the final handle is closed. This takes care of all
abnormal exits, and possibly even power failures. The program can just
exit on error without deleting the file. Once written successfully,
the automatic-delete indicator is cleared so that the file survives.</p>

<p>The target application for this technique supports both Linux and
Windows, so I would need to figure it out for both systems. On
Windows, there’s the flag <code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. I’d just need
to find a way to clear it. On POSIX, file would be unlinked while
being written, and linked into the filesystem on success. The latter
turns out to be a lot harder than I expected.</p>

<h3 id="solution-for-windows">Solution for Windows</h3>

<p>I’ll start with Windows since the technique actually works fairly well
here — ignoring the usual, dumb Win32 filesystem caveats. This is a
little surprising, since it’s usually Win32 that makes these things
far more difficult than they should be.</p>

<p>The primary Win32 function for opening and creating files is
<a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx">CreateFile</a>. There are many options, but the key is
<code class="language-plaintext highlighter-rouge">FILE_FLAG_DELETE_ON_CLOSE</code>. Here’s how an application might typically
open a file for output.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">GENERIC_WRITE</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">create</span> <span class="o">=</span> <span class="n">CREATE_ALWAYS</span><span class="p">;</span>
<span class="n">DWORD</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">FILE_FLAG_DELETE_ON_CLOSE</span><span class="p">;</span>
<span class="n">HANDLE</span> <span class="n">f</span> <span class="o">=</span> <span class="n">CreateFile</span><span class="p">(</span><span class="s">"out.tmp"</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">create</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>This special flag asks Windows to delete the file as soon as the last
handle to to <em>file object</em> is closed. Notice I said file object, not
file, since <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20160108-00/?p=92821">these are different things</a>. The catch: This flag
is a property of the file object, not the file, and cannot be removed.</p>

<p>However, the solution is simple. Create a new link to the file so that
it survives deletion. This even works for files residing on a network
shares.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">CreateHardLink</span><span class="p">(</span><span class="s">"out"</span><span class="p">,</span> <span class="s">"out.tmp"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>  <span class="c1">// deletes out.tmp file</span>
</code></pre></div></div>

<p>The gotcha is that the underlying filesystem must be NTFS. FAT32
doesn’t support hard links. Unfortunately, since FAT32 remains the
least common denominator and is still widely used for removable media,
depending on the application, your users may expect support for saving
files to FAT32. A workaround is probably required.</p>

<h3 id="solution-for-linux">Solution for Linux</h3>

<p>This is where things really fall apart. It’s just <em>barely</em> possible on
Linux, it’s messy, and it’s not portable anywhere else. There’s no way
to do this for POSIX in general.</p>

<p>My initial thought was to create a file then unlink it. Unlike the
situation on Windows, files can be unlinked while they’re currently
open by a process. These files are finally deleted when the last file
descriptor (the last reference) is closed. Unfortunately, using
unlink(2) to remove the last link to a file prevents that file from
being linked again.</p>

<p>Instead, the solution is to use the relatively new (since Linux 3.11),
Linux-specific <code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag when creating the file. Instead of a
filename, this variation of open(2) takes a directory and creates an
unnamed, temporary file in it. These files are special in that they’re
permitted to be given a name in the filesystem at some future point.</p>

<p>For this example, I’ll assume the output is relative to the current
working directory. If it’s not, you’ll need to open an additional file
descriptor for the parent directory, and also use openat(2) to avoid
possible race conditions (since paths can change from under you). The
number of ways this can fail is already rapidly multiplying.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">open</span><span class="p">(</span><span class="s">"."</span><span class="p">,</span> <span class="n">O_TMPFILE</span><span class="o">|</span><span class="n">O_WRONLY</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
</code></pre></div></div>

<p>The catch is that only a handful of filesystems support <code class="language-plaintext highlighter-rouge">O_TMPFILE</code>.
It’s like the FAT32 problem above, but worse. You could easily end up
in a situation where it’s not supported, and will almost certainly
require a workaround.</p>

<p>Linking a file from a file descriptor is where things get messier. The
file descriptor must be linked with linkat(2) from its name on the
/proc virtual filesystem, constructed as a string. The following
snippet comes straight from the Linux open(2) manpage.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
<span class="n">sprintf</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"/proc/self/fd/%d"</span><span class="p">,</span> <span class="n">fd</span><span class="p">);</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">AT_FDCWD</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_SYMLINK_FOLLOW</span><span class="p">);</span>
</code></pre></div></div>

<p>Even on Linux, /proc isn’t always available, such as within a chroot
or a container, so this part can fail as well. In theory there’s a way
to do this with the Linux-specific <code class="language-plaintext highlighter-rouge">AT_EMPTY_PATH</code> and avoid /proc,
but I couldn’t get it to work.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Note: this doesn't actually work for me.</span>
<span class="n">linkat</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s">"out"</span><span class="p">,</span> <span class="n">AT_EMPTY_PATH</span><span class="p">);</span>
</code></pre></div></div>

<p>Given the poor portability (even within Linux), the number of ways
this can go wrong, and that a workaround is definitely needed anyway,
I’d say this technique is worthless. I’m going to stick with the
tried-and-true approach for this one.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Appending to a File from Multiple Processes</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/08/03/"/>
    <id>urn:uuid:93473b6d-3be3-3d0c-d7d5-6ad485c1e9a0</id>
    <updated>2016-08-03T16:17:44Z</updated>
    <category term="c"/><category term="linux"/><category term="posix"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p>Suppose you have multiple processes appending output to the same file
without explicit synchronization. These processes might be working in
parallel on different parts of the same problem, or these might be
threads blocked individually reading different external inputs. There
are two concerns that come into play:</p>

<p>1) <strong>The append must be atomic</strong> such that it doesn’t clobber previous
    appends by other threads and processes. For example, suppose a
    write requires two separate operations: first moving the file
    pointer to the end of the file, then performing the write. There
    would be a race condition should another process or thread
    intervene in between with its own write.</p>

<p>2) <strong>The output will be interleaved.</strong> The primary solution is to
   design the data format as atomic records, where the ordering of
   records is unimportant — like rows in a relational database. This
   could be as simple as a text file with each line as a record. The
   concern is then ensuring records are written atomically.</p>

<p>This article discusses processes, but the same applies to threads when
directly dealing with file descriptors.</p>

<h3 id="appending">Appending</h3>

<p>The first concern is solved by the operating system, with one caveat.
On POSIX systems, opening a file with the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag will
guarantee that <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html">writes always safely append</a>.</p>

<blockquote>
  <p>If the <code class="language-plaintext highlighter-rouge">O_APPEND</code> flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and
no intervening file modification operation shall occur between
changing the file offset and the write operation.</p>
</blockquote>

<p>However, this says nothing about interleaving. <strong>Two processes
successfully appending to the same file will result in all their bytes
in the file in order, but not necessarily contiguously.</strong></p>

<p>The caveat is that not all filesystems are POSIX-compatible. Two
famous examples are NFS and the Hadoop Distributed File System (HDFS).
On these networked filesystems, appends are simulated and subject to
race conditions.</p>

<p>On POSIX systems, fopen(3) with the <code class="language-plaintext highlighter-rouge">a</code> flag <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html">will use
<code class="language-plaintext highlighter-rouge">O_APPEND</code></a>, so you don’t necessarily need to use open(2). On
Linux this can be verified for any language’s standard library with
strace.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fopen</span><span class="p">(</span><span class="s">"/dev/null"</span><span class="p">,</span> <span class="s">"a"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the result of the trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
</code></pre></div></div>

<p>For Win32, the equivalent is the <code class="language-plaintext highlighter-rouge">FILE_APPEND_DATA</code> access right, and
similarly <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/gg258116(v=vs.85).aspx">only applies to “local files.”</a></p>

<h3 id="interleaving-and-pipes">Interleaving and Pipes</h3>

<p>The interleaving problem has two layers, and gets more complicated the
more correct you want to be. Let’s start with pipes.</p>

<p>On POSIX, a pipe is unseekable and doesn’t have a file position, so
appends are the only kind of write possible. When writing to a pipe
(or FIFO), writes less than the system-defined <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> are
guaranteed to be atomic and non-interleaving.</p>

<blockquote>
  <p>Write requests of <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes or less shall not be interleaved
with data from other processes doing writes on the same pipe. Writes
of greater than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes may have data interleaved, on
arbitrary boundaries, with writes by other processes, […]</p>
</blockquote>

<p>The minimum value for <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> for POSIX systems is 512 bytes. On
Linux it’s 4kB, and on other systems <a href="http://ar.to/notes/posix">it’s as high as 32kB</a>.
As long as each record is less than 512 bytes, a simple write(2) will
due. None of this depends on a filesystem since no files are involved.</p>

<p>If more than <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes isn’t enough, the POSIX writev(2) can be
used to <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/writev.html">atomically write up to <code class="language-plaintext highlighter-rouge">IOV_MAX</code> buffers</a> of
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes. The minimum value for <code class="language-plaintext highlighter-rouge">IOV_MAX</code> is 16, but is
typically 1024. This means the maximum safe atomic write size for
pipes — and therefore the largest record size — for a perfectly
portable program is 8kB (16✕512). On Linux it’s 4MB.</p>

<p>That’s all at the system call level. There’s another layer to contend
with: buffered I/O in your language’s standard library. Your program
may pass data in appropriately-sized pieces for atomic writes to the
I/O library, but it may be undoing your hard work, concatenating all
these writes into a buffer, splitting apart your records. For this
part of the article, I’ll focus on single-threaded C programs.</p>

<p>Suppose you’re writing a simple space-separated format with one line
per record.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">baz</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">condition</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d %d %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">,</span> <span class="n">baz</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Whether or not this works depends on how <code class="language-plaintext highlighter-rouge">stdout</code> is buffered. C
standard library streams (<code class="language-plaintext highlighter-rouge">FILE *</code>) have three buffering modes:
unbuffered, line buffered, and fully buffered. Buffering is configured
through setbuf(3) and setvbuf(3), and the initial buffering state of a
stream depends on various factors. For buffered streams, the default
buffer is at least <code class="language-plaintext highlighter-rouge">BUFSIZ</code> bytes, itself at least 256 (C99
§7.19.2¶7). Note: threads share this buffer.</p>

<p>Since each record in the above program easily fits inside 256 bytes,
if stdout is a line buffered pipe then this program will interleave
correctly on any POSIX system without further changes.</p>

<p>If instead your output is comma-separated values (CSV) and <a href="https://tools.ietf.org/html/rfc4180">your
records may contain new line characters</a>, there are two
approaches. In each, the record must still be no larger than
<code class="language-plaintext highlighter-rouge">PIPE_BUF</code> bytes.</p>

<ul>
  <li>
    <p>Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3))
and output the entire buffer in a single fwrite(3). While I believe
this will always work in practice, it’s not guaranteed by the C
specification, which defines fwrite(3) as a series of fputc(3) calls
(C99 §7.19.8.2¶2).</p>
  </li>
  <li>
    <p>Fully buffered pipe: set a sufficiently large stream buffer and
follow each record with a fflush(3). Unlike fwrite(3) on an
unbuffered stream, the specification says the buffer will be
“transmitted to the host environment as a block” (C99 §7.19.3¶3),
so this should be perfectly correct on any POSIX system.</p>
  </li>
</ul>

<p>If your situation is more complicated than this, you’ll probably have
to bypass your standard library buffered I/O and call write(2) or
writev(2) yourself.</p>

<h4 id="practical-application">Practical Application</h4>

<p>If interleaving writes to a pipe stdout sounds contrived, here’s the
real life scenario: GNU xargs with its <code class="language-plaintext highlighter-rouge">--max-procs</code> (<code class="language-plaintext highlighter-rouge">-P</code>) option to
process inputs in parallel.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xargs -n1 -P$(nproc) myprogram &lt; inputs.txt | cat &gt; outputs.csv
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">| cat</code> ensures the output of each <code class="language-plaintext highlighter-rouge">myprogram</code> process is
connected to the same pipe rather than to the same file.</p>

<p>A non-portable alternative to <code class="language-plaintext highlighter-rouge">| cat</code>, especially if you’re
dispatching processes and threads yourself, is the splice(2) system
call on Linux. It efficiently moves the output from the pipe to the
output file without an intermediate copy to userspace. GNU Coreutils’
cat doesn’t use this.</p>

<h4 id="win32-pipes">Win32 Pipes</h4>

<p>On Win32, <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365152(v=vs.85).aspx">anonymous pipes</a> have no semantics regarding
interleaving. <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa365150(v=vs.85).aspx">Named pipes</a> have per-client buffers that
prevent interleaving. However, the pipe buffer size is unspecified,
and requesting a particular size is only advisory, so it comes down to
trial and error, though the unstated limits should be comparatively
generous.</p>

<h3 id="interleaving-and-files">Interleaving and Files</h3>

<p>Suppose instead of a pipe we have an <code class="language-plaintext highlighter-rouge">O_APPEND</code> file on POSIX. Common
wisdom states that the same <code class="language-plaintext highlighter-rouge">PIPE_BUF</code> atomic write rule applies.
While this often works, especially on Linux, this is not correct. The
POSIX specification doesn’t require it and <a href="http://www.notthewizard.com/2014/06/17/are-files-appends-really-atomic/">there are systems where it
doesn’t work</a>.</p>

<p>If you know the particular limits of your operating system <em>and</em>
filesystem, and you don’t care much about portability, then maybe you
can get away with interleaving appends. For full portability, pipes
are required.</p>

<p>On Win32, writes on local files up to the underlying drive’s sector
size (typically 512 bytes to 4kB) are atomic. Otherwise the only
options are deprecated Transactional NTFS (TxF), or manually
synchronizing your writes. All in all, it’s going to take more work to
get correct.</p>

<h3 id="conclusion">Conclusion</h3>

<p>My true use case for mucking around with clean, atomic appends is to
compute giant CSV tables in parallel, with the intention of later
loading into a SQL database (i.e. SQLite) for analysis. A more robust
and traditional approach would be to write results directly into the
database as they’re computed. But I like the platform-neutral
intermediate CSV files — good for archival and sharing — and the
simplicity of programs generating the data — concerned only with
atomic write semantics rather than calling into a particular SQL
database API.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Const and Optimization in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/07/25/"/>
    <id>urn:uuid:f785bc3b-dd3d-3952-2696-91eafe6b019d</id>
    <updated>2016-07-25T02:06:04Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>Today there was a <a href="https://redd.it/4udqwj">question on /r/C_Programming</a> about the
effect of C’s <code class="language-plaintext highlighter-rouge">const</code> on optimization. Variations of this question
have been asked many times over the past two decades. Personally, I
blame naming of <code class="language-plaintext highlighter-rouge">const</code>.</p>

<p>Given this program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="p">);</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">x</span><span class="p">);</span>
        <span class="n">y</span> <span class="o">+=</span> <span class="n">x</span><span class="p">;</span>  <span class="c1">// this load not optimized out</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">foo</code> takes a pointer to const, which is a promise from
the author of <code class="language-plaintext highlighter-rouge">foo</code> that it won’t modify the value of <code class="language-plaintext highlighter-rouge">x</code>. Given this
information, it would seem the compiler may assume <code class="language-plaintext highlighter-rouge">x</code> is always zero,
and therefore <code class="language-plaintext highlighter-rouge">y</code> is always zero.</p>

<p>However, inspecting the assembly output of several different compilers
shows that <code class="language-plaintext highlighter-rouge">x</code> is loaded each time around the loop. Here’s gcc 4.9.2
at -O3, with annotations, for x86-64,</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbp</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">xor</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; y = 0</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>              <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>    <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>        <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">add</span>    <span class="nb">ebp</span><span class="p">,</span> <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>  <span class="c1">; y += x  (not optmized?)</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x18</span>             <span class="c1">; deallocate x</span>
     <span class="nf">mov</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">ebp</span>              <span class="c1">; return y</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">pop</span>    <span class="nb">rbp</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The output of clang 3.5 (with -fno-unroll-loops) is the same, except
ebp and ebx are swapped, and the computation of <code class="language-plaintext highlighter-rouge">&amp;x</code> is hoisted out of
the loop, into <code class="language-plaintext highlighter-rouge">r14</code>.</p>

<p>Are both compilers failing to take advantage of this useful
information? Wouldn’t it be undefined behavior for <code class="language-plaintext highlighter-rouge">foo</code> to modify
<code class="language-plaintext highlighter-rouge">x</code>? Surprisingly, the answer is no. <em>In this situation</em>, this would
be a perfectly legal definition of <code class="language-plaintext highlighter-rouge">foo</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">foo</span><span class="p">(</span><span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">readonly_x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="o">*</span><span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="n">readonly_x</span><span class="p">;</span>  <span class="c1">// cast away const</span>
    <span class="p">(</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The key thing to remember is that <a href="http://yarchive.net/comp/const.html"><strong><code class="language-plaintext highlighter-rouge">const</code> doesn’t mean
constant</strong></a>. Chalk it up as a misnomer. It’s not an
optimization tool. It’s there to inform programmers — not the compiler
— as a tool to catch a certain class of mistakes at compile time. I
like it in APIs because it communicates how a function will use
certain arguments, or how the caller is expected to handle returned
pointers. It’s usually not strong enough for the compiler to change
its behavior.</p>

<p>Despite what I just said, occasionally the compiler <em>can</em> take
advantage of <code class="language-plaintext highlighter-rouge">const</code> for optimization. The C99 specification, in
§6.7.3¶5, has one sentence just for this:</p>

<blockquote>
  <p>If an attempt is made to modify an object defined with a
const-qualified type through use of an lvalue with
non-const-qualified type, the behavior is undefined.</p>
</blockquote>

<p>The original <code class="language-plaintext highlighter-rouge">x</code> wasn’t const-qualified, so this rule didn’t apply.
And there aren’t any rules against casting away <code class="language-plaintext highlighter-rouge">const</code> to modify an
object that isn’t itself <code class="language-plaintext highlighter-rouge">const</code>. This means the above (mis)behavior
of <code class="language-plaintext highlighter-rouge">foo</code> isn’t undefined behavior <em>for this call</em>. Notice how the
undefined-ness of <code class="language-plaintext highlighter-rouge">foo</code> depends on how it was called.</p>

<p>With one tiny tweak to <code class="language-plaintext highlighter-rouge">bar</code>, I can make this rule apply, allowing the
optimizer do some work on it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">const</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>

<p>The compiler may now assume that <strong><code class="language-plaintext highlighter-rouge">foo</code> modifying <code class="language-plaintext highlighter-rouge">x</code> is undefined
behavior, therefore <em>it never happens</em></strong>. For better or worse, this is
a major part of how a C optimizer reasons about your programs. The
compiler is free to assume <code class="language-plaintext highlighter-rouge">x</code> never changes, allowing it to optimize
out both the per-iteration load and <code class="language-plaintext highlighter-rouge">y</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>            <span class="c1">; loop variable i</span>
     <span class="nf">sub</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; allocate x</span>
     <span class="nf">mov</span>    <span class="kt">dword</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">],</span> <span class="mi">0</span>  <span class="c1">; x = 0</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span><span class="o">+</span><span class="mh">0xc</span><span class="p">]</span>      <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">add</span>    <span class="nb">rsp</span><span class="p">,</span> <span class="mh">0x10</span>           <span class="c1">; deallocate x</span>
     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>            <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>The load disappears, <code class="language-plaintext highlighter-rouge">y</code> is gone, and the function always returns
zero.</p>

<p>Curiously, the specification <em>almost</em> allows the compiler to go
further. Consider what would happen if <code class="language-plaintext highlighter-rouge">x</code> were allocated somewhere
off the stack in read-only memory. That transformation would look like
this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">__x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">int</span>
<span class="nf">bar</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">foo</span><span class="p">(</span><span class="o">&amp;</span><span class="n">__x</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We would see a few more instructions shaved off (<a href="http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models">-fPIC, small code
model</a>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">section</span> <span class="nv">.rodata</span>
<span class="nl">x:</span>   <span class="kd">dd</span>     <span class="mi">0</span>

<span class="nf">section</span> <span class="nv">.text</span>
<span class="nl">bar:</span>
     <span class="nf">push</span>   <span class="nb">rbx</span>
     <span class="nf">mov</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mh">0xa</span>        <span class="c1">; loop variable i</span>

<span class="nl">.L0:</span> <span class="nf">lea</span>    <span class="nb">rdi</span><span class="p">,</span> <span class="p">[</span><span class="nv">rel</span> <span class="nv">x</span><span class="p">]</span>    <span class="c1">; compute &amp;x</span>
     <span class="nf">call</span>   <span class="nv">foo</span>
     <span class="nf">sub</span>    <span class="nb">ebx</span><span class="p">,</span> <span class="mi">1</span>
     <span class="nf">jne</span>    <span class="nv">.L0</span>

     <span class="nf">xor</span>    <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>        <span class="c1">; return 0</span>
     <span class="nf">pop</span>    <span class="nb">rbx</span>
     <span class="nf">ret</span>
</code></pre></div></div>

<p>Because the address of <code class="language-plaintext highlighter-rouge">x</code> is taken and “leaked,” this last transform
is not permitted. If <code class="language-plaintext highlighter-rouge">bar</code> is called recursively such that a second
address is taken for <code class="language-plaintext highlighter-rouge">x</code>, that second pointer would compare equally
(<code class="language-plaintext highlighter-rouge">==</code>) with the first pointer depsite being semantically distinct
objects, which is forbidden (§6.5.9¶6).</p>

<p>Even with this special <code class="language-plaintext highlighter-rouge">const</code> rule, stick to using <code class="language-plaintext highlighter-rouge">const</code> for
yourself and for your fellow human programmers. Let the optimizer
reason for itself about what is constant and what is not.</p>

<p>Travis Downs nicely summed up this article in the comments:</p>

<blockquote>
  <p>In general, <code class="language-plaintext highlighter-rouge">const</code> <em>declarations</em> can’t help the optimizer, but
<code class="language-plaintext highlighter-rouge">const</code> <em>definitions</em> can.</p>
</blockquote>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>Four Ways to Compile C for Windows</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/06/13/"/>
    <id>urn:uuid:1e99288c-0500-36f5-9fe7-262e6c6287c4</id>
    <updated>2016-06-13T04:13:25Z</updated>
    <category term="c"/><category term="cpp"/><category term="win32"/>
    <content type="html">
      <![CDATA[<p><em>Update 2020: If you’re on Windows, just use <a href="https://github.com/skeeto/w64devkit"><strong>w64devkit</strong></a>.
It’s <a href="/blog/2020/05/15/">my own toolchain distribution</a>, and it’s the best option
available. <a href="/blog/2020/09/25/">Everything you need</a> is in one package.</em></p>

<p>I primarily work on and develop for unix-like operating systems —
Linux in particular. However, when it comes to desktop applications,
most potential users are on Windows. Rather than develop on Windows,
which I’d rather avoid, I’ll continue developing, testing, and
debugging on Linux while keeping portability in mind. Unfortunately
every option I’ve found for building Windows C programs has some
significant limitations. These limitations advise my approach to
portability and restrict the C language features used by the program
for all platforms.</p>

<p>As of this writing I’ve identified four different practical ways to
build C applications for Windows. This information will definitely
become further and further out of date as this article ages, so if
you’re visiting from the future take a moment to look at the date.
Except for LLVM shaking things up recently, development tooling on
unix-like systems has had the same basic form for the past 15 years
(i.e. dominated by GCC). While Visual C++ has been around for more
than two decades, the tooling on Windows has seen more churn by
comparison.</p>

<p>Before I get into the specifics, let me point out a glaring problem
common to all four: Unicode arguments and filenames. Microsoft jumped
the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all
worlds, being a variable length encoding (surrogate pairs), backwards
incompatible (<a href="http://utf8everywhere.org/">unlike UTF-8</a>), and having byte-order issues (BOM).
Most Win32 functions that accept strings generally come in two flavors,
ANSI and UTF-16. The standard, portable C library functions wrap the
ANSI-flavored functions. This means <strong>portable C programs can’t interact
with Unicode filenames</strong>. (Update 2021: <a href="/blog/2021/12/30/">Now they can</a>.) They must
call the non-portable, Windows-specific versions. This includes <code class="language-plaintext highlighter-rouge">main</code>
itself, which is only handed ANSI-truncated arguments.</p>

<p>Compare this to unix-like systems, which generally adopted UTF-8, but
rather as a convention than as a hard rule. The operating system
doesn’t know or care about Unicode. Program arguments and filenames
are just zero-terminated bytestrings. Implicitly decoding these as
UTF-8 <a href="https://utcc.utoronto.ca/~cks/space/blog/python/Python3UnicodeIssue">would be a mistake anyway</a>. What happens when the
encoding isn’t valid?</p>

<p>This doesn’t <em>have</em> to be a problem on Windows. A Windows standard C
library could connect to Windows’ Unicode-flavored functions and
encode to/from UTF-8 as needed, allowing portable programs to maintain
the bytestring illusion. It’s only that none of the existing standard
C libraries do it this way.</p>

<h3 id="mingw-w64">Mingw-w64</h3>

<p>Of course my first natural choice is MinGW, specifically the
<a href="http://mingw-w64.org/doku.php">Mingw-w64</a> fork. It’s GCC ported to Windows. You can
continue relying on GCC-specific features when you need them. It’s got
all the core language features up through C11, plus the common
extensions. It’s probably packaged by your Linux distribution of
choice, making it trivial to cross-compile programs and libraries from
Linux — and with Wine you can even execute them on x86. Like regular
GCC, it outputs GDB-friendly DWARF debugging information, so you can
debug applications with GDB.</p>

<p>If I’m using Mingw-w64 on Windows, <del>I prefer to do so from inside
Cygwin</del>. Since it provides a complete POSIX environment, it maximizes
portability for the whole tool chain. This isn’t strictly required.</p>

<p>However, it has one big flaw. Unlike unix-like systems, Windows doesn’t
supply a system standard C library. That’s the compiler’s job. But
Mingw-w64 doesn’t have one. Instead it links against <code class="language-plaintext highlighter-rouge">msvcrt.dll</code>,
<del>which <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">isn’t officially supported by Microsoft</a>. It just
happens to exist on modern Windows installations. Since it’s not
supported,</del> it’s way out of date and doesn’t support much of C99. A lot
of these problems are patched over by the compiler, <del>but if you’re
relying on Mingw-w64, you still have to stick to some C89 library
features, such as limiting yourself to the C89 printf specifiers</del>.</p>

<p><del>Update: Mārtiņš Možeiko has pointed out <code class="language-plaintext highlighter-rouge">__USE_MINGW_ANSI_STDIO</code>, an
undocumented feature that fixes the printf family. I now use this by
default in all of my Mingw-w64 builds. It fixes most of the formatted
output issues, except that it’s incompatible with the <a href="https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-g_t_0040code_007bformat_007d-function-attribute-3318"><code class="language-plaintext highlighter-rouge">format</code> function
attribute</a>.</del> (Update 2021: Mingw-w64 now does the right thing
out of the box.)</p>

<p><del>Another problem is that <a href="http://thelinuxjedi.blogspot.com/2014/07/tripping-up-using-mingw.html">position-independent code generation is
broken</a>, and so ASLR is not an option. This means binaries produced
by Mingw-w64 are less secure than they should be. There are also a
number of <a href="https://gcc.gnu.org/ml/gcc-bugs/2015-05/msg02025.html">subtle code generation bugs</a> that might arise if you’re
doing something unusual.</del> (Update 2021: Mingw-w64 makes PIE mandatory.)</p>

<h3 id="visual-c">Visual C++</h3>

<p>The behemoth usually considered in this situation is Visual Studio and
the Visual C++ build tools. I strongly prefer open source development
tools, and Visual Studio obviously the <em>least</em> open source option, but
at least it’s cost-free these days. Now, I have absolutely no interest
in Visual Studio, but fortunately the Visual C++ compiler and
associated build tools can be used standalone, supporting both C and
C++.</p>

<p>Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that
batch file in a cmd.exe console and the Visual C++ command line build
tools will be made available in that console and in any programs
executed from it (your editor). It includes the compiler (cl.exe),
linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe),
and more. It also includes a <a href="/blog/2016/04/30/">mostly POSIX-complete</a> make called
nmake.exe. All these tools are noisy and print a copyright banner on
every invocation, so get used to passing <code class="language-plaintext highlighter-rouge">-nologo</code> every time, which
suppresses some of it.</p>

<p>When I said behemoth, I meant it. In my experience it literally takes
<em>hours</em> (unattended) to install Visual Studio 2015. <del>The good news is you
don’t actually need it all anymore. The build tools <a href="http://landinghub.visualstudio.com/visual-cpp-build-tools">are available
standalone</a>. While it’s still a larger and slower installation
process than it really should be, it’s is much more reasonable to
install. It’s good enough that I’d even say I’m comfortable relying on
it for Windows builds.</del> (Update: The build tools are unfortunately no
longer standalone.)</p>

<p>That being said, it’s not without its flaws. Microsoft has never
announced any plans to support C99. They only care about C++, with C as
a second class citizen. Since C++11 incorporated most of C99 and
Microsoft supports C++11, Visual Studio 2015 supports most of C99. The
only things missing as far as I can tell are variable length arrays
(VLAs), complex numbers, and C99’s array parameter declarators, since
none of these were adopted by C++. Some C99 features are considered
extensions (as they would be for C89), so you’ll also get warnings about
them, which can be disabled.</p>

<p>The command line interface (option flags, intermediates, etc.) isn’t
quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so
<strong>you’ll need separate Makefiles</strong>, or you’ll need to use a build
system that generates Visual C++ Makefiles.</p>

<p><del>Debugging is a major problem.</del> (Update 2022: It’s actually quite good
once <a href="/blog/2022/06/26/">you know how to do it</a>.) Visual C++ outputs separate .pdb
<a href="https://en.wikipedia.org/wiki/Program_database">program database</a> files, which aren’t usable from GDB. Visual
Studio has a built-in debugger, though it’s not included in the
standalone Visual C++ build tools. <del>I’m still searching for a decent
debugging solution for this scenario. I tried WinDbg, but I can’t stand
it.</del> (Update 2022: <a href="https://www.youtube.com/watch?v=r9eQth4Q5jg">RemedyBG is amazing</a>.)</p>

<p>In general the output code performance is on par with GCC and Clang,
so you’re not really gaining or losing performance with Visual C++.</p>

<h3 id="clang">Clang</h3>

<p>Unsurprisingly, <a href="http://clang.llvm.org/">Clang</a> has been ported to Windows. It’s like
Mingw-w64 in that you get the same features and interface across
platforms.</p>

<p>Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead <strong>it
relies directly on the official Windows SDK</strong>. You’ll basically need
to install the Visual C++ build tools as if were going to build with
Visual C++. This means no practical cross-platform builds and you’re
still relying on the proprietary Microsoft toolchain. In the past you
even had to use Microsoft’s linker, but LLVM now provides its own.</p>

<p>It generates GDB-friendly DWARF debug information (in addition to
CodeView) so in theory <strong>you can debug with GDB</strong> again. I haven’t
given this a thorough evaluation yet.</p>

<h3 id="pelles-c">Pelles C</h3>

<p>Finally there’s <a href="http://www.smorgasbordet.com/pellesc/">Pelles C</a>. It’s cost-free but not open
source. It’s a reasonable, small install that includes a full IDE with
an integrated debugger and command line tools. It has its own C
library and Win32 SDK with the most complete C11 support around. It
also supports OpenMP 3.1. All in all it’s pretty nice and is something
I wouldn’t be afraid to rely upon for Windows builds.</p>

<p>Like Visual C++, it has a couple of “povars” batch files to set up the
right environment, which includes a C compiler, linker, assembler,
etc. The compiler interface mostly mimics cl.exe, though there are far
fewer code generation options. The make program, pomake.exe, mimics
nmake.exe, but is even less POSIX-complete. The compiler’s <strong>output
code performance is also noticeably poorer than GCC, Clang, and Visual
C++</strong>. It’s definitely a less mature compiler.</p>

<p>It outputs CodeView debugging information, so <strong>GDB is of no use</strong>.
The best solution is to simply use the compiler built into the IDE,
which can be invoked directly from the command line. You don’t
normally need to code from within the IDE just to use the debugger.</p>

<p>Like Visual C++, it’s Windows only, so cross-compilation isn’t really
in the picture.</p>

<p>If performance isn’t of high importance, and you don’t require
specific code generation options, then Pelles C is a nice choice for
Windows builds.</p>

<h3 id="other-options">Other Options</h3>

<p>I’m sure there are a few other options out there, and I’d like to hear
about them so I can try them out. I focused on these since they’re all
cost free and easy to download. If I have to register or pay, then
it’s not going to beat these options.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>You Can't Always Hash Pointers in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/05/30/"/>
    <id>urn:uuid:0fa3c99b-88ed-3a02-0342-4ee7536cc7ed</id>
    <updated>2016-05-30T23:59:46Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>Occasionally I’ve needed to key a hash table with C pointers. I don’t
care about the contents of the object itself — especially if it might
change — just its pointer identity. For example, suppose I’m using
null-terminated strings as keys and I know these strings will always
be interned in a common table. These strings can be compared directly
by their pointer values (<code class="language-plaintext highlighter-rouge">str_a == str_b</code>) rather than, more slowly,
by their contents (<code class="language-plaintext highlighter-rouge">strcmp(str_a, str_b) == 0</code>). The intern table
ensures that these expressions both have the same result.</p>

<p>As a key in a hash table, or other efficient map/dictionary data
structure, I’ll need to turn pointers into numerical values. However,
<strong>C pointers aren’t integers</strong>. Following certain rules it’s permitted
to cast pointers to integers and back, but doing so will reduce the
program’s portability. The most important consideration is that <strong>the
integer form isn’t guaranteed to have any meaningful or stable
value</strong>. In other words, even in a conforming implementation, the same
pointer might cast to two different integer values. This would break
any algorithm that isn’t keenly aware of the implementation details.</p>

<p>To show why this is, I’m going to be citing the relevant parts of the
C99 standard (ISO/IEC 9899:1999). The <a href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf">draft for C99</a> is freely
available (and what I use myself since I’m a cheapass). My purpose is
<em>not</em> to discourage you from casting pointers to integers and using
the result. The vast majority of the time this works fine and as you
would expect. I just think it’s an interesting part of the language,
and C/C++ programmers should be aware of potential the trade-offs.</p>

<h3 id="integer-to-pointer-casts">Integer to pointer casts</h3>

<p>What does the standard have to say about casting pointers to integers?
§6.3.2.3¶5:</p>

<blockquote>
  <p>An integer may be converted to any pointer type. Except as
previously specified, the result is implementation-defined, might
not be correctly aligned, might not point to an entity of the
referenced type, and might be a trap representation.</p>
</blockquote>

<p>It also includes a footnote:</p>

<blockquote>
  <p>The mapping functions for converting a pointer to an integer or an
integer to a pointer are intended to be consistent with the
addressing structure of the execution environment.</p>
</blockquote>

<p>Casting an integer to a pointer depends entirely on the
implementation. This is intended for things like memory mapped
hardware. The programmer may need to access memory as a specific
physical address, which would be encoded in the source as an integer
constant and cast to a pointer of the appropriate type.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">read_sensor_voltage</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">*</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="mh">0x1ffc</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It may also be used by a loader and dynamic linker to compute the
virtual address of various functions and variables, then cast to a
pointer before use.</p>

<p>Both cases are already dependent on implementation defined behavior,
so there’s nothing lost in relying on these casts.</p>

<p>An integer constant expression of 0 is a special case. It casts to a
NULL pointer in all implementations (§6.3.2.3¶3). However, a NULL
pointer doesn’t necessarily point to address zero, nor is it
necessarily a zero bit pattern (i.e. beware <code class="language-plaintext highlighter-rouge">memset</code> and <code class="language-plaintext highlighter-rouge">calloc</code> on
memory with pointers). It’s just guaranteed never to compare equally
with a valid object, and it is undefined behavior to dereference.</p>

<h3 id="pointer-to-integer-casts">Pointer to integer casts</h3>

<p>What about the other way around? §6.3.2.3¶6:</p>

<blockquote>
  <p>Any pointer type may be converted to an integer type. Except as
previously specified, the result is implementation-defined. If the
result cannot be represented in the integer type, the behavior is
undefined. The result need not be in the range of values of any
integer type.</p>
</blockquote>

<p>Like before, it’s implementation defined. However, the negatives are a
little stronger: the cast itself may be undefined behavior. I
speculate this is tied to integer overflow. The last part makes
pointer to integer casts optional for an implementation. This is one
way that the hash table above would be less portable.</p>

<p>When the cast is always possible, an implementation can provide an
integer type wide enough to hold any pointer value. §7.18.1.4¶1:</p>

<blockquote>
  <p>The following type designates a signed integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:</p>

  <p><code class="language-plaintext highlighter-rouge">intptr_t</code></p>

  <p>The following type designates an unsigned integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:</p>

  <p><code class="language-plaintext highlighter-rouge">uintptr_t</code></p>

  <p>These types are optional.</p>
</blockquote>

<p>The take-away is that the integer has no meaningful value. The only
guarantee is that the integer can be cast back into a void pointer
that will compare equally. It would be perfectly legal for an
implementation to pass these assertions (and still sometimes fail).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">example</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr_a</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">ptr_b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ptr_a</span> <span class="o">==</span> <span class="n">ptr_b</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">uintptr_t</span> <span class="n">int_a</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">ptr_a</span><span class="p">;</span>
        <span class="kt">uintptr_t</span> <span class="n">int_b</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">ptr_b</span><span class="p">;</span>
        <span class="n">assert</span><span class="p">(</span><span class="n">int_a</span> <span class="o">!=</span> <span class="n">int_b</span><span class="p">);</span>
        <span class="n">assert</span><span class="p">((</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">int_a</span> <span class="o">==</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">int_b</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since the bits don’t have any particular meaning, arithmetic
operations involving them will also have no meaning. When a pointer
might map to two different integers, the hash values might not match
up, breaking hash tables that rely on them. Even with <code class="language-plaintext highlighter-rouge">uintptr_t</code>
provided, casting pointers to integers isn’t useful without also
relying on implementation defined properties of the result.</p>

<h3 id="reasons-for-this-pointer-insanity">Reasons for this pointer insanity</h3>

<p>What purpose could such strange pointer-to-integer casts serve?</p>

<p>A security-conscious implementation may choose to annotate pointers
with additional information by setting unused bits. It might be for
<a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy bounds checks</a> or, someday, in an <a href="http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html">undefined behavior
sanitizer</a>. Before dereferencing annotated pointers, the
metadata bits would be checked for validity, and cleared/set before
use as an address. Or it may <a href="/blog/2016/04/10/">map the same object at multiple virtual
addresses</a>) to avoid setting/clearing the metadata bits,
providing interoperability with code unaware of the annotations. When
pointers are compared, these bits would be ignored.</p>

<p>When these annotated pointers are cast to integers, the metadata bits
will be present, but a program using the integer wouldn’t know their
meaning without tying itself closely to that implementation.
Completely unused bits may even be filled with random garbage when
cast. It’s allowed.</p>

<p>You may have been thinking before about using a union or <code class="language-plaintext highlighter-rouge">char *</code> to
bypass the cast and access the raw pointer bytes, but you’d run into
the same problems on the same implementations.</p>

<h3 id="conforming-programs">Conforming programs</h3>

<p>The standard makes a distinction between <em>strictly conforming
programs</em> (§4¶5) and <em>conforming programs</em> (§4¶7). A strictly
conforming program must not produce output depending on implementation
defined behavior nor exceed minimum implementation limits. Very few
programs fit in this category, including any program using <code class="language-plaintext highlighter-rouge">uintptr_t</code>
since it’s optional. Here are more examples of code that isn’t
strictly conforming:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">printf</span><span class="p">(</span><span class="s">"%zu"</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span> <span class="c1">// §6.5.3.4</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">);</span>      <span class="c1">// §6.5¶4</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="n">MAX_INT</span><span class="p">);</span>      <span class="c1">// §5.2.4.2.1</span>
</code></pre></div></div>

<p>On the other hand, a <em>conforming program</em> is allowed to depend on
implementation defined behavior. Relying on meaningful, stable values
for pointers cast to <code class="language-plaintext highlighter-rouge">uintptr_t</code>/<code class="language-plaintext highlighter-rouge">intptr_t</code> is conforming even if your
program may exhibit bugs on some implementations.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Mapping Multiple Memory Views in User Space</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/04/10/"/>
    <id>urn:uuid:373e602e-0d43-3e03-f02c-2d169eb14df5</id>
    <updated>2016-04-10T21:59:16Z</updated>
    <category term="c"/><category term="linux"/><category term="win32"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Modern operating systems run processes within <em>virtual memory</em> using a
piece of hardware called a <em>memory management unit</em> (MMU). The MMU
contains a <em>page table</em> that defines how virtual memory maps onto
<em>physical memory</em>. The operating system is responsible for maintaining
this page table, mapping and unmapping virtual memory to physical
memory as needed by the processes it’s running. If a process accesses
a page that is not currently mapped, it will trigger a <em>page fault</em>
and the execution of the offending thread will be paused until the
operating system maps that page.</p>

<p>This functionality allows for a neat hack: A physical memory address
can be mapped to multiple virtual memory addresses at the same time. A
process running with such a mapping will see these regions of memory
as aliased — views of the same physical memory. A store to one of
these addresses will simultaneously appear across all of them.</p>

<p>Some useful applications of this feature include:</p>

<ul>
  <li>An extremely fast, large memory “copy” by mapping the source memory
overtop the destination memory.</li>
  <li>Trivial interoperability between code instrumented with <a href="https://www.usenix.org/legacy/event/sec09/tech/full_papers/akritidis.pdf">baggy
bounds checking</a> [PDF] and non-instrumented code. A few bits
of each pointer are reserved to tag the pointer with the size of its
memory allocation. For compactness, the stored size is rounded up to
a power of two, making it “baggy.” Instrumented code checks this tag
before making a possibly-unsafe dereference. Normally, instrumented
code would need to clear (or set) these bits before dereferencing or
before passing it to non-instrumented code. Instead, the allocation
could be mapped simultaneously at each location for every possible
tag, making the pointer valid no matter its tag bits.</li>
  <li>Two responses to <a href="/blog/2016/03/31/">my last post on hotpatching</a> suggested
that, instead of modifying the instruction directly, memory
containing the modification could be mapped over top of the code. I
would copy the code to another place in memory, safely modify it in
private, switch the page protections from write to execute (both for
W^X and for <a href="https://web.archive.org/web/20190323050330/http://stackoverflow.com/a/18905927">other hardware limitations</a>), then map it over
the target. Restoring the original behavior would be as simple as
unmapping the change.</li>
</ul>

<p>Both POSIX and Win32 allow user space applications to create these
aliased mappings. The original purpose for these APIs is for shared
memory between processes, where the same physical memory is mapped
into two different processes’ virtual memory. But the OS doesn’t stop
us from mapping the shared memory to a different address within the
same process.</p>

<h3 id="posix-memory-mapping">POSIX Memory Mapping</h3>

<p>On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are <code class="language-plaintext highlighter-rouge">shm_open(3)</code>, <code class="language-plaintext highlighter-rouge">ftruncate(2)</code>, and <code class="language-plaintext highlighter-rouge">mmap(2)</code>.</p>

<p>First, create a file descriptor to shared memory using <code class="language-plaintext highlighter-rouge">shm_open</code>. It
has very similar semantics to <code class="language-plaintext highlighter-rouge">open(2)</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">shm_open</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">oflag</span><span class="p">,</span> <span class="n">mode_t</span> <span class="n">mode</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">name</code> works much like a filesystem path, but is actually a
different namespace (though on Linux it <em>is</em> a tmpfs mounted at
<code class="language-plaintext highlighter-rouge">/dev/shm</code>). Resources created here (<code class="language-plaintext highlighter-rouge">O_CREAT</code>) will persist until
explicitly deleted (<code class="language-plaintext highlighter-rouge">shm_unlink(3)</code>) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via <code class="language-plaintext highlighter-rouge">fork(2)</code> or through UNIX domain sockets, so a name
isn’t strictly required.</p>

<p>OpenBSD introduced <a href="http://man.openbsd.org/OpenBSD-current/man3/shm_mkstemp.3"><code class="language-plaintext highlighter-rouge">shm_mkstemp(3)</code></a> to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
<code class="language-plaintext highlighter-rouge">O_TMPFILE</code> flag may or may not provide a fix (<a href="http://comments.gmane.org/gmane.linux.man/9815">it’s
undocumented</a>).</p>

<p>The portable workaround is to attempt to choose a unique name, open
the file with <code class="language-plaintext highlighter-rouge">O_CREAT | O_EXCL</code> (either atomically create the file or
fail), <code class="language-plaintext highlighter-rouge">shm_unlink</code> the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="s">"/example"</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">handle_error</span><span class="p">();</span> <span class="c1">// non-local exit</span>
<span class="n">shm_unlink</span><span class="p">(</span><span class="s">"/example"</span><span class="p">);</span>
</code></pre></div></div>

<p>The shared memory object is brand new (<code class="language-plaintext highlighter-rouge">O_EXCL</code>) and is therefore of
zero size. <code class="language-plaintext highlighter-rouge">ftruncate</code> sets it to the desired size. This does <em>not</em>
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally <code class="language-plaintext highlighter-rouge">mmap</code> the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (<code class="language-plaintext highlighter-rouge">munmap(2)</code>).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</code></pre></div></div>

<p>At this point both <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="mh">0xdeafbeef</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%p %p 0x%x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="o">*</span><span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>Will print out something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x6ffffff0000 0x6fffffe0000 0xdeafbeef
</code></pre></div></div>

<p>It’s also possible to do all this only with <code class="language-plaintext highlighter-rouge">open(2)</code> and <code class="language-plaintext highlighter-rouge">mmap(2)</code> by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.</p>

<h3 id="windows-memory-mapping">Windows Memory Mapping</h3>

<p>Windows is very similar, but directly supports anonymous shared
memory. The key functions are <code class="language-plaintext highlighter-rouge">CreateFileMapping</code>, and
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code>.</p>

<p>First create a file mapping object from an invalid handle value. Like
POSIX, the word “file” is used without actually involving files.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">);</span>
<span class="n">HANDLE</span> <span class="n">h</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">,</span>
                             <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                             <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                             <span class="nb">NULL</span><span class="p">);</span>
</code></pre></div></div>

<p>There’s no truncate step because the space is allocated at creation
time via the two-part size argument.</p>

<p>Then, just like <code class="language-plaintext highlighter-rouge">mmap</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint32_t</span> <span class="o">*</span><span class="n">a</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="kt">uint32_t</span> <span class="o">*</span><span class="n">b</span> <span class="o">=</span> <span class="n">MapViewOfFile</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
<span class="n">CloseHandle</span><span class="p">(</span><span class="n">h</span><span class="p">);</span>
</code></pre></div></div>

<p>If I wanted to choose the target address myself, I’d call
<code class="language-plaintext highlighter-rouge">MapViewOfFileEx</code> instead, which takes the address as additional
argument.</p>

<p>From here on it’s the same as above.</p>

<h3 id="generalizing-the-api">Generalizing the API</h3>

<p>Having some fun with this, I came up with a general API to allocate an
aliased mapping at an arbitrary number of addresses.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>  <span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">);</span>
</code></pre></div></div>

<p>Values in the address array must either be page-aligned or NULL to
allow the operating system to choose, in which case the map address is
written to the array.</p>

<p>It returns 0 on success. It may fail if the size is too small (0), too
large, too many file descriptors, etc.</p>

<p>Pass the same pointers back to <code class="language-plaintext highlighter-rouge">memory_alias_unmap</code> to free the
mappings. When called correctly it cannot fail, so there’s no return
value.</p>

<p>The full source is here: <a href="/download/memalias.c">memalias.c</a></p>

<h4 id="posix">POSIX</h4>

<p>Starting with the simpler of the two functions, the POSIX
implementation looks like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">munmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The complex part is creating the mapping:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">path</span><span class="p">),</span> <span class="s">"/%s(%lu,%p)"</span><span class="p">,</span>
             <span class="n">__FUNCTION__</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span><span class="p">)</span><span class="n">getpid</span><span class="p">(),</span> <span class="n">addrs</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">shm_open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span> <span class="o">|</span> <span class="n">O_EXCL</span><span class="p">,</span> <span class="mo">0600</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">shm_unlink</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="n">ftruncate</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">mmap</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span>
                        <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_SHARED</span><span class="p">,</span>
                        <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">MAP_FAILED</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The shared object name includes the process ID and pointer array
address, so there really shouldn’t be any non-malicious name
collisions, even if called from multiple threads in the same process.</p>

<p>Otherwise it just walks the array setting up the mappings.</p>

<h4 id="windows">Windows</h4>

<p>The Windows version is very similar.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">memory_alias_unmap</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">size</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">UnmapViewOfFile</span><span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since Windows tracks the size internally, it’s unneeded and ignored.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">memory_alias_map</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">naddr</span><span class="p">,</span> <span class="kt">void</span> <span class="o">**</span><span class="n">addrs</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">HANDLE</span> <span class="n">m</span> <span class="o">=</span> <span class="n">CreateFileMapping</span><span class="p">(</span><span class="n">INVALID_HANDLE_VALUE</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">,</span>
                                 <span class="n">PAGE_READWRITE</span><span class="p">,</span>
                                 <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span>
                                 <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">m</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
    <span class="n">DWORD</span> <span class="n">access</span> <span class="o">=</span> <span class="n">FILE_MAP_ALL_ACCESS</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">naddr</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">MapViewOfFileEx</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">access</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">addrs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memory_alias_unmap</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">addrs</span><span class="p">);</span>
            <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
            <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">CloseHandle</span><span class="p">(</span><span class="n">m</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the future I’d like to find some unique applications of these
multiple memory views.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Hotpatching a C Function on x86</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/03/31/"/>
    <id>urn:uuid:49f6ea3c-d44a-3bed-1aad-70ad47e325c6</id>
    <updated>2016-03-31T23:59:59Z</updated>
    <category term="x86"/><category term="c"/><category term="linux"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>In this post I’m going to do a silly, but interesting, exercise that
should never be done in any program that actually matters. I’m going
write a program that changes one of its function definitions while
it’s actively running and using that function. Unlike <a href="/blog/2014/12/23/">last
time</a>, this won’t involve shared libraries, but it will require
x86_64 and GCC. Most of the time it will work with Clang, too, but
it’s missing an important compiler option that makes it stable.</p>

<p>If you want to see it all up front, here’s the full source:
<a href="/download/hotpatch.c">hotpatch.c</a></p>

<p>Here’s the function that I’m going to change:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s dead simple, but that’s just for demonstration purposes. This
will work with any function of arbitrary complexity. The definition
will be changed to this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"goodbye %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">x</span><span class="o">++</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I was only going change the string, but I figured I should make it a
little more interesting.</p>

<p>Here’s how it’s going to work: I’m going to overwrite the beginning of
the function with an unconditional jump that immediately moves control
to the new definition of the function. It’s vital that the function
prototype does not change, since that would be a <em>far</em> more complex
problem.</p>

<p><strong>But first there’s some preparation to be done.</strong> The target needs to
be augmented with some GCC function attributes to prepare it for its
redefinition. As is, there are three possible problems that need to be
dealt with:</p>

<ul>
  <li>I want to hotpatch this function <em>while it is being used</em> by another
thread <em>without</em> any synchronization. It may even be executing the
function at the same time I clobber its first instructions with my
jump. If it’s in between these instructions, disaster will strike.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">ms_hook_prologue</code> function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.</p>

<ul>
  <li>The prologue NOP needs to be updated atomically. I can’t let the
other thread see a half-written instruction or, again, disaster. On
x86 this means I have an alignment requirement. Since I’m
overwriting an 8-byte instruction, I’m specifically going to need
8-byte alignment to get an atomic write.</li>
</ul>

<p>The solution is the <code class="language-plaintext highlighter-rouge">aligned</code> function attribute, ensuring the
hotpatch prologue is properly aligned.</p>

<ul>
  <li>The final problem is that there must be exactly one copy of this
function in the compiled program. It must never be inlined or
cloned, since these won’t be hotpatched.</li>
</ul>

<p>As you might have guessed, this is primarily fixed with the <code class="language-plaintext highlighter-rouge">noinline</code>
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the <code class="language-plaintext highlighter-rouge">noclone</code> attribute.</p>

<p>Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (<code class="language-plaintext highlighter-rouge">__asm("")</code>). Since <code class="language-plaintext highlighter-rouge">puts()</code> has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.</p>

<p>What does the function look like now?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span> <span class="p">((</span><span class="n">ms_hook_prologue</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noinline</span><span class="p">))</span>
<span class="n">__attribute__</span> <span class="p">((</span><span class="n">noclone</span><span class="p">))</span>
<span class="kt">void</span>
<span class="nf">hello</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kr">__asm</span><span class="p">(</span><span class="s">""</span><span class="p">);</span>
    <span class="n">puts</span><span class="p">(</span><span class="s">"hello"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And what does the assembly look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ objdump -Mintel -d hotpatch
0000000000400848 &lt;hello&gt;:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 &lt;puts@plt&gt;
</code></pre></div></div>

<p>It’s 8-byte aligned and it has the 8-byte NOP: that <code class="language-plaintext highlighter-rouge">lea</code> instruction
does nothing. It copies <code class="language-plaintext highlighter-rouge">rsp</code> into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.</p>

<h3 id="hotpatching">Hotpatching</h3>

<p>Next, let’s take a look at the function that will perform the
hotpatch. I’ve written a generic patching function for this purpose.
This part is entirely specific to x86.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">hotpatch</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">target</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">replacement</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">assert</span><span class="p">(((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="mh">0x07</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="c1">// 8-byte aligned?</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">page</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">target</span> <span class="o">&amp;</span> <span class="o">~</span><span class="mh">0xfff</span><span class="p">);</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_WRITE</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
    <span class="kt">uint32_t</span> <span class="n">rel</span> <span class="o">=</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">replacement</span> <span class="o">-</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">-</span> <span class="mi">5</span><span class="p">;</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="kt">uint8_t</span> <span class="n">bytes</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
        <span class="kt">uint64_t</span> <span class="n">value</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">instruction</span> <span class="o">=</span> <span class="p">{</span> <span class="p">{</span><span class="mh">0xe9</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">8</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">,</span> <span class="n">rel</span> <span class="o">&gt;&gt;</span> <span class="mi">24</span><span class="p">}</span> <span class="p">};</span>
    <span class="o">*</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="p">)</span><span class="n">target</span> <span class="o">=</span> <span class="n">instruction</span><span class="p">.</span><span class="n">value</span><span class="p">;</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It takes the address of the function to be patched and the address of
the function to replace it. As mentioned, the target <em>must</em> be 8-byte
aligned (enforced by the assert). It’s also important this function is
only called by one thread at a time, even on different targets. If
that was a concern, I’d wrap it in a mutex to create a critical
section.</p>

<p>There are a number of things going on here, so let’s go through them
one at a time:</p>

<h4 id="make-the-function-writeable">Make the function writeable</h4>

<p>The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
<code class="language-plaintext highlighter-rouge">mprotect()</code>. If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
<code class="language-plaintext highlighter-rouge">hotpatch()</code>) I’d be in trouble.</p>

<p>It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hugepages). <em>Warning</em>: I’m being a
bad programmer and not checking the result of <code class="language-plaintext highlighter-rouge">mprotect()</code>. If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard <a href="https://marc.info/?t=145942649500004">in the
future</a>. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.</p>

<p>What if the function straddles pages? Well, I’m only patching the
first 8 bytes, which, thanks to alignment, will sit entirely inside
the page I just found. It’s not an issue.</p>

<p>At the end of the function, I <code class="language-plaintext highlighter-rouge">mprotect()</code> the page back to
non-writeable.</p>

<h4 id="create-the-instruction">Create the instruction</h4>

<p>I’m assuming the replacement function is within 2GB of the original in
virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s
no 64-bit relative jump, and I only have 8 bytes to work within
anyway. Looking that up in <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">the Intel manual</a>, I see this:</p>

<p><img src="/img/misc/jmp-e9.png" alt="" /></p>

<p>Fortunately it’s a really simple instruction. It’s opcode 0xE9 and
it’s followed immediately by the 32-bit displacement. The instruction
is 5 bytes wide.</p>

<p>To compute the relative jump, I take the difference between the
functions, minus 5. Why the 5? The jump address is computed from the
position <em>after</em> the jump instruction and, as I said, it’s 5 bytes
wide.</p>

<p>I put 0xE9 in a byte array, followed by the little endian
displacement. The astute may notice that the displacement is signed
(it can go “up” or “down”) and I used an unsigned integer. That’s
because it will overflow nicely to the right value and make those
shifts clean.</p>

<p>Finally, the instruction byte array I just computed is written over
the hotpatch NOP as a single, atomic, 64-bit store.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    *(uint64_t *)target = instruction.value;
</code></pre></div></div>

<p>Other threads will see either the NOP or the jump, nothing in between.
There’s no synchronization, so other threads may continue to execute
the NOP for a brief moment even through I’ve clobbered it, but that’s
fine.</p>

<h3 id="trying-it-out">Trying it out</h3>

<p>Here’s what my test program looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">worker</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">(</span><span class="kt">void</span><span class="p">)</span><span class="n">arg</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">hello</span><span class="p">();</span>
        <span class="n">usleep</span><span class="p">(</span><span class="mi">100000</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
    <span class="n">pthread_create</span><span class="p">(</span><span class="o">&amp;</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">worker</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="n">getchar</span><span class="p">();</span>
    <span class="n">hotpatch</span><span class="p">(</span><span class="n">hello</span><span class="p">,</span> <span class="n">new_hello</span><span class="p">);</span>
    <span class="n">pthread_join</span><span class="p">(</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I fire off the other thread to keep it pinging at <code class="language-plaintext highlighter-rouge">hello()</code>. In the
main thread, it waits until I hit enter to give the program input,
after which it calls <code class="language-plaintext highlighter-rouge">hotpatch()</code> and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.</p>

<h3 id="further-reading">Further Reading</h3>

<p>These related articles have been shared with me since publishing this
article:</p>

<ul>
  <li><a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583">Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?</a></li>
  <li><a href="http://jbremer.org/x86-api-hooking-demystified/">x86 API Hooking Demystified</a></li>
  <li><a href="http://conf.researchr.org/event/pldi-2016/pldi-2016-papers-living-on-the-edge-rapid-toggling-probes-with-cross-modification-on-x86">Living on the edge: Rapid-toggling probes with cross modification on x86</a></li>
  <li><a href="https://lwn.net/Articles/620640/">arm64: alternatives runtime patching</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Calling the Native API While Freestanding</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/02/28/"/>
    <id>urn:uuid:3649a761-d3dc-391b-7f24-a28398100102</id>
    <updated>2016-02-28T23:47:22Z</updated>
    <category term="c"/><category term="win32"/><category term="x86"/>
    <content type="html">
      <![CDATA[<p>When developing <a href="/blog/2016/01/31/">minimal, freestanding Windows programs</a>, it’s
obviously beneficial to take full advantage of dynamic libraries that
are already linked rather than duplicate that functionality in the
application itself. Every Windows process automatically, and
involuntarily, has kernel32.dll and ntdll.dll loaded into its process
space before it starts. As discussed previously, kernel32.dll provides
the Windows API (Win32). The other, ntdll.dll, provides the <em>Native
API</em> for user space applications, and is the focus of this article.</p>

<p>The Native API is a low-level API, a foundation for the implementation
of the Windows API and various components that don’t use the Windows
API (drivers, etc.). It includes a runtime library (RTL) suitable for
replacing important parts of the C standard library, unavailable to
freestanding programs. Very useful for a minimal program.</p>

<p>Unfortunately, <em>using</em> the Native API is a bit of a minefield. Not all
of the documented Native API functions are actually exported by
ntdll.dll, making them inaccessible both for linking and
GetProcAddress(). Some are exported, but not documented as such.
Others are documented as exported but are not documented <em>when</em> (which
release of Windows). If a particular function wasn’t exported until
Windows 8, I don’t want to use when supporting Windows 7.</p>

<p>This is further complicated by the Microsoft Windows SDK, where many
of these functions are just macros that alias C runtime functions.
Naturally, MinGW closely follows suit. For example, in both cases,
here is how the Native API function <code class="language-plaintext highlighter-rouge">RtlCopyMemory</code> is “declared.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))
</span></code></pre></div></div>

<p>This is certainly not useful for freestanding programs, though it has
a significant benefit for <em>hosted</em> programs: The C compiler knows the
semantics of <code class="language-plaintext highlighter-rouge">memcpy()</code> and can properly optimize around it. Any C
compiler worth its salt will replace a small or aligned, fixed-sized
<code class="language-plaintext highlighter-rouge">memcpy()</code> or <code class="language-plaintext highlighter-rouge">memmove()</code> with the equivalent inlined code. For
example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">char</span> <span class="n">buffer0</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="kt">char</span> <span class="n">buffer1</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="c1">// ...</span>
    <span class="n">memcpy</span><span class="p">(</span><span class="n">buffer0</span><span class="p">,</span> <span class="n">buffer1</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
    <span class="c1">// ...</span>
</code></pre></div></div>

<p>On x86_64 (GCC 4.9.3, -Os), this <code class="language-plaintext highlighter-rouge">memmove()</code> call is replaced with
two instructions. This isn’t possible when calling an opaque function
in a non-standard dynamic library. The side effects could be anything.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">movaps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">48</span><span class="p">]</span>
    <span class="nf">movaps</span>  <span class="p">[</span><span class="nb">rsp</span> <span class="o">+</span> <span class="mi">32</span><span class="p">],</span> <span class="nv">xmm0</span>
</code></pre></div></div>

<p>These Native API macro aliases are what have allowed certain Wine
issues <a href="https://bugs.winehq.org/show_bug.cgi?id=38783">to slip by unnoticed for years</a>. Very few user space
applications actually call Native API functions, even when addressed
directly by name in the source. The development suite is pulling a
bait and switch.</p>

<p>Like <a href="/blog/2014/12/09/">last time I danced at the edge of the compiler</a>, this has
caused headaches in my recent experimentation with freestanding
executables. The MinGW headers assume that the programs including them
will link against a C runtime. Dirty hack warning: To work around it,
I have to undo the definition in the MinGW headers and make my own.
For example, to use the real <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="cp">#undef RtlMoveMemory
</span><span class="kr">__declspec</span><span class="p">(</span><span class="n">dllimport</span><span class="p">)</span>
<span class="kt">void</span> <span class="nf">RtlMoveMemory</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">);</span>
</code></pre></div></div>

<p>Anywhere where I might have previously used <code class="language-plaintext highlighter-rouge">memmove()</code> I can instead
use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code>. Or I could trivially supply my own wrapper:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">memmove</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">RtlMoveMemory</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">n</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">d</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As of this writing, the same approach is not reliable with
<code class="language-plaintext highlighter-rouge">RtlCopyMemory()</code>, the cousin to <code class="language-plaintext highlighter-rouge">memcpy()</code>. As far as I can tell, it
was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June
2015). Use <code class="language-plaintext highlighter-rouge">RtlMoveMemory()</code> instead. The overlap-handling overhead is
negligible compared to the function call overhead anyway.</p>

<p>As a side note: one reason besides minimalism for not implementing
your own <code class="language-plaintext highlighter-rouge">memmove()</code> is that it can’t be implemented efficiently in a
conforming C program. According to the language specification, your
implementation of <code class="language-plaintext highlighter-rouge">memmove()</code> would not be permitted to compare its
pointer arguments with <code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>, <code class="language-plaintext highlighter-rouge">&lt;=</code>, or <code class="language-plaintext highlighter-rouge">&gt;=</code>. That would lead to
undefined behavior when pointing to unrelated objects (ISO/IEC
9899:2011 §6.5.8¶5). The simplest legal approach is to allocate a
temporary buffer, copy the source buffer into it, then copy it into
the destination buffer. However, buffer allocation may fail — i.e.
NULL return from <code class="language-plaintext highlighter-rouge">malloc()</code> — introducing a failure case to
<code class="language-plaintext highlighter-rouge">memmove()</code>, which isn’t supposed to fail.</p>

<p>Update July 2016: Alex Elsayed pointed out a solution to the
<code class="language-plaintext highlighter-rouge">memmove()</code> problem in the comments. In short: iterate over the
buffers bytewise (<code class="language-plaintext highlighter-rouge">char *</code>) using equality (<code class="language-plaintext highlighter-rouge">==</code>) tests to check for
an overlap. In theory, a compiler could optimize away the loop and
make it efficient.</p>

<p>I keep mentioning Wine because I’ve been careful to ensure my
applications run correctly with it. So far it’s worked <em>perfectly</em>
with both Windows API and Native API functions. Thanks to the hard
work behind the Wine project, despite being written sharply against
the Windows API, these tiny programs remain relatively portable (x86
and ARM). It’s a good fit for graphical applications (games), but I
would <em>never</em> write a command line application like this. The command
line has always been a second class citizen on Windows.</p>

<p>Now that I’ve got these Native API issues sorted out, I’ve
significantly expanded the capabilities of my tiny, freestanding
programs without adding anything to their size. Functions like
<code class="language-plaintext highlighter-rouge">RtlUnicodeToUTF8N()</code> and <code class="language-plaintext highlighter-rouge">RtlUTF8ToUnicodeN()</code> will surely be handy.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Small, Freestanding Windows Executables</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2016/01/31/"/>
    <id>urn:uuid:8eddc701-52d3-3b0c-a8a8-dd13da6ead2c</id>
    <updated>2016-01-31T22:53:03Z</updated>
    <category term="c"/><category term="x86"/><category term="linux"/><category term="win32"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><strong>Update</strong>: This is old and <a href="/blog/2023/02/15/">was <strong>updated in 2023</strong></a>!</p>

<p>Recently I’ve been experimenting with freestanding C programs on
Windows. <em>Freestanding</em> refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and <a href="/blog/2014/12/09/">similar, bare metal
situations</a>. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size <code class="language-plaintext highlighter-rouge">memmove()</code> with move instructions. Since a freestanding
program would supply its own, it may have different semantics.</p>

<p>My usual go to for C/C++ on Windows is <a href="http://mingw-w64.org/">Mingw-w64</a>, which has
greatly suited my needs the past couple of years. It’s <a href="https://packages.debian.org/search?keywords=mingw-w64">packaged on
Debian</a>, and, when combined with Wine, allows me to fully develop
Windows applications on Linux. Being GCC, it’s also great for
cross-platform development since it’s essentially the same compiler as
the other platforms. The primary difference is the interface to the
operating system (POSIX vs. Win32).</p>

<p>However, it has one glaring flaw inherited from MinGW: it links
against msvcrt.dll, an ancient version of the Microsoft C runtime
library that currently ships with Windows. Besides being dated and
quirky, <a href="https://web.archive.org/web/0/https://blogs.msdn.microsoft.com/oldnewthing/20140411-00/?p=1273">it’s not an official part of Windows</a> and never has
been, despite its inclusion with every release since Windows 95.
Mingw-w64 doesn’t have a C library of its own, instead patching over
some of the flaws of msvcrt.dll and linking against it.</p>

<p>Since so much depends on msvcrt.dll despite its unofficial nature,
it’s unlikely Microsoft will ever drop it from future releases of
Windows. However, if strict correctness is a concern, we must ask
Mingw-w64 not to link against it. An alternative would be
<a href="http://plibc.sourceforge.net/">PlibC</a>, though the LGPL licensing is unfortunate. Another is
Cygwin, which is a very complete POSIX environment, but is heavy and
GPL-encumbered.</p>

<p>Sometimes I’d prefer to be more direct: <a href="https://hero.handmade.network/forums/code-discussion/t/94-guide_-_how_to_avoid_c_c++_runtime_on_windows">skip the C standard library
altogether</a> and talk directly to the operating system. On Windows
that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only
links against system DLLs.</p>

<h3 id="linux-vs-windows">Linux vs. Windows</h3>

<p>The most important benefit of a standard library like libc is a
portable, uniform interface to the host system. So long as the
standard library suits its needs, the same program can run anywhere.
Without it, the programs needs an implementation of each
host-specific interface.</p>

<p>On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (<code class="language-plaintext highlighter-rouge">int 0x80</code> on x86, <code class="language-plaintext highlighter-rouge">syscall</code> on
x86-64, <code class="language-plaintext highlighter-rouge">swi</code> on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.</p>

<p>For example, here’s a function for a 1-argument system call on x86-64.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span>
<span class="nf">syscall1</span><span class="p">(</span><span class="kt">long</span> <span class="n">n</span><span class="p">,</span> <span class="kt">long</span> <span class="n">arg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">long</span> <span class="n">result</span><span class="p">;</span>
    <span class="n">__asm__</span> <span class="k">volatile</span> <span class="p">(</span>
        <span class="s">"syscall"</span>
        <span class="o">:</span> <span class="s">"=a"</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
        <span class="o">:</span> <span class="s">"a"</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="s">"D"</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
    <span class="p">);</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Then <code class="language-plaintext highlighter-rouge">exit()</code> is implemented on top. Note: A <em>real</em> libc would do
cleanup before exiting, like calling registered <code class="language-plaintext highlighter-rouge">atexit()</code> functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;syscall.h&gt;</span><span class="c1">  // defines SYS_exit</span><span class="cp">
</span>
<span class="kt">void</span>
<span class="nf">exit</span><span class="p">(</span><span class="kt">int</span> <span class="n">code</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">syscall1</span><span class="p">(</span><span class="n">SYS_exit</span><span class="p">,</span> <span class="n">code</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to <code class="language-plaintext highlighter-rouge">malloc()</code>). It’s not POSIX, but it has analogs to much of
the same functionality.</p>

<h3 id="program-entry">Program Entry</h3>

<p>The standard entry for a C program is <code class="language-plaintext highlighter-rouge">main()</code>. However, this is not
the application’s <em>true</em> entry. The entry is in the C library, which
does some initialization before calling your <code class="language-plaintext highlighter-rouge">main()</code>. When <code class="language-plaintext highlighter-rouge">main()</code>
returns, it performs cleanup and exits. Without a C library, programs
don’t start at <code class="language-plaintext highlighter-rouge">main()</code>.</p>

<p>On Linux the default entry is the symbol <code class="language-plaintext highlighter-rouge">_start</code>. It’s prototype
would look like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">_start</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Returning from this function leads to a segmentation fault, so it’s up
to your application to perform the exit system call rather than
return.</p>

<p>On Windows, the entry depends on the type of application. The two
relevant subsystems today are the <em>console</em> and <em>windows</em> subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give <code class="language-plaintext highlighter-rouge">-mconsole</code> (default) or <code class="language-plaintext highlighter-rouge">-mwindows</code> to the linker to
choose the subsystem.</p>

<p>The default <a href="https://msdn.microsoft.com/en-us/library/f9t8842e.aspx">entry for each is slightly different</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">WINAPI</span> <span class="nf">WinMainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>Unlike Linux’s <code class="language-plaintext highlighter-rouge">_start</code>, Windows programs can safely return from these
functions, similar to <code class="language-plaintext highlighter-rouge">main()</code>, hence the <code class="language-plaintext highlighter-rouge">int</code> return. The <code class="language-plaintext highlighter-rouge">WINAPI</code>
macro means the function may have a special calling convention,
depending on the platform.</p>

<p>On any system, you can choose a different entry symbol or address
using the <code class="language-plaintext highlighter-rouge">--entry</code> option to the GNU linker.</p>

<h3 id="disabling-libgcc">Disabling libgcc</h3>

<p>One problem I’ve run into is Mingw-w64 generating code that calls
<code class="language-plaintext highlighter-rouge">__chkstk_ms()</code> from libgcc. I believe this is a long-standing bug,
since <code class="language-plaintext highlighter-rouge">-ffreestanding</code> should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable <a href="https://metricpanda.com/rival-fortress-update-45-dealing-with-__chkstk-__chkstk_ms-when-cross-compiling-for-windows/">the stack
probe</a> and pre-commit the whole stack.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
</code></pre></div></div>

<p>Alternatively you could link against libgcc (statically) with <code class="language-plaintext highlighter-rouge">-lgcc</code>,
but, again, I’m going for a tiny executable.</p>

<h3 id="a-freestanding-example">A freestanding example</h3>

<p>Here’s an example of a Windows “Hello, World” that doesn’t use a C
library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;windows.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="n">WINAPI</span>
<span class="nf">mainCRTStartup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">msg</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="n">HANDLE</span> <span class="n">stdout</span> <span class="o">=</span> <span class="n">GetStdHandle</span><span class="p">(</span><span class="n">STD_OUTPUT_HANDLE</span><span class="p">);</span>
    <span class="n">WriteFile</span><span class="p">(</span><span class="n">stdout</span><span class="p">,</span> <span class="n">msg</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">msg</span><span class="p">),</span> <span class="p">(</span><span class="n">DWORD</span><span class="p">[]){</span><span class="mi">0</span><span class="p">},</span> <span class="nb">NULL</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To build it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32
</code></pre></div></div>

<p>Notice I manually linked against kernel32.dll. The stripped final
result is only 4kB, mostly PE padding. There are <a href="http://www.phreedom.org/research/tinype/">techniques to trim
this down even further</a>, but for a substantial program it
wouldn’t make a significant difference.</p>

<p>From here you could create a GUI by linking against <code class="language-plaintext highlighter-rouge">user32.dll</code> and
<code class="language-plaintext highlighter-rouge">gdi32.dll</code> (both also part of Win32) and calling the appropriate
functions. I already <a href="/blog/2015/06/06/">ported my OpenGL demo</a> to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).</p>

<p>I may go this route for <a href="http://7drl.org/2016/01/13/7drl-2016-announced-for-5-13-march/">the upcoming 7DRL 2016</a> in March.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Counting Processor Cores in Emacs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/10/14/"/>
    <id>urn:uuid:dbfba1a0-b3af-356d-4d01-96917d622906</id>
    <updated>2015-10-14T03:17:16Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>One of the great advantages of dependency analysis is parallelization.
Modern processors reorder instructions whose results don’t affect each
other. Compilers reorder expressions and statements to improve
throughput. Build systems know which outputs are inputs for other
targets and can choose any arbitrary build order within that
constraint. This article involves the last case.</p>

<p>The build system I use most often is GNU Make, either directly or
indirectly (Autoconf, CMake). It’s far from perfect, but it does what
I need. I almost always invoke it from within Emacs rather than in a
terminal. In fact, I do it so often that I’ve wrapped Emacs’ <code class="language-plaintext highlighter-rouge">compile</code>
command for rapid invocation.</p>

<p>I recently helped a co-worker set this set up for himself, so it had
me thinking about the problem again. The situation <a href="https://github.com/skeeto/.emacs.d">in my
config</a> is much more complicated than it needs to be, so I’ll
share a simplified version instead.</p>

<p>First bring in the usual goodies (we’re going to be making closures):</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;;; -*- lexical-binding: t; -*-</span>
<span class="p">(</span><span class="nb">require</span> <span class="ss">'cl-lib</span><span class="p">)</span>
</code></pre></div></div>

<p>We need a couple of configuration variables.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defvar</span> <span class="nv">quick-compile-command</span> <span class="s">"make -k "</span><span class="p">)</span>
<span class="p">(</span><span class="nb">defvar</span> <span class="nv">quick-compile-build-file</span> <span class="s">"Makefile"</span><span class="p">)</span>
</code></pre></div></div>

<p>Then a couple of interactive functions to set these on the fly. It’s
not strictly necessary, but I like giving each a key binding. I also
like having a history available via <code class="language-plaintext highlighter-rouge">read-string</code>, so I can switch
between a couple of different options with ease.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">quick-compile-set-command</span> <span class="p">(</span><span class="nv">command</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">interactive</span>
   <span class="p">(</span><span class="nb">list</span> <span class="p">(</span><span class="nv">read-string</span> <span class="s">"Command: "</span> <span class="nv">quick-compile-command</span><span class="p">)))</span>
  <span class="p">(</span><span class="nb">setf</span> <span class="nv">quick-compile-command</span> <span class="nv">command</span><span class="p">))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">quick-compile-set-build-file</span> <span class="p">(</span><span class="nv">build-file</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">interactive</span>
   <span class="p">(</span><span class="nb">list</span> <span class="p">(</span><span class="nv">read-string</span> <span class="s">"Build file: "</span> <span class="nv">quick-compile-build-file</span><span class="p">)))</span>
  <span class="p">(</span><span class="nb">setf</span> <span class="nv">quick-compile-build-file</span> <span class="nv">build-file</span><span class="p">))</span>
</code></pre></div></div>

<p>Now finally to the good part. Below, <code class="language-plaintext highlighter-rouge">quick-compile</code> is a
non-interactive function that returns an interactive closure ready to
be bound to any key I desire. It takes an optional target. This means
I don’t use the above <code class="language-plaintext highlighter-rouge">quick-compile-set-command</code> to choose a target,
only for setting other options. That will make more sense in a moment.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">cl-defun</span> <span class="nv">quick-compile</span> <span class="p">(</span><span class="k">&amp;optional</span> <span class="p">(</span><span class="nv">target</span> <span class="s">""</span><span class="p">))</span>
  <span class="s">"Return an interaction function that runs `compile' for TARGET."</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">()</span>
    <span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">save-buffer</span><span class="p">)</span>  <span class="c1">; so I don't get asked</span>
    <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">default-directory</span>
            <span class="p">(</span><span class="nv">locate-dominating-file</span>
             <span class="nv">default-directory</span> <span class="nv">quick-compile-build-file</span><span class="p">)))</span>
      <span class="p">(</span><span class="k">if</span> <span class="nv">default-directory</span>
          <span class="p">(</span><span class="nb">compile</span> <span class="p">(</span><span class="nv">concat</span> <span class="nv">quick-compile-command</span> <span class="s">" "</span> <span class="nv">target</span><span class="p">))</span>
        <span class="p">(</span><span class="nb">error</span> <span class="s">"Cannot find %s"</span> <span class="nv">quick-compile-build-file</span><span class="p">)))))</span>
</code></pre></div></div>

<p>It traverses up (down?) the directory hierarchy towards root looking
for a Makefile — or whatever is set for <code class="language-plaintext highlighter-rouge">quick-compile-build-file</code>
— then invokes the build system there. I <a href="http://aegis.sourceforge.net/auug97.pdf">don’t believe in recursive
<code class="language-plaintext highlighter-rouge">make</code></a>.</p>

<p>So how do I put this to use? I clobber some key bindings I don’t
otherwise care about. A better choice might be the F-keys, but my
muscle memory is already committed elsewhere.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">global-set-key</span> <span class="p">(</span><span class="nv">kbd</span> <span class="s">"C-x c"</span><span class="p">)</span> <span class="p">(</span><span class="nv">quick-compile</span><span class="p">))</span> <span class="c1">; default target</span>
<span class="p">(</span><span class="nv">global-set-key</span> <span class="p">(</span><span class="nv">kbd</span> <span class="s">"C-x C"</span><span class="p">)</span> <span class="p">(</span><span class="nv">quick-compile</span> <span class="s">"clean"</span><span class="p">))</span>
<span class="p">(</span><span class="nv">global-set-key</span> <span class="p">(</span><span class="nv">kbd</span> <span class="s">"C-x t"</span><span class="p">)</span> <span class="p">(</span><span class="nv">quick-compile</span> <span class="s">"test"</span><span class="p">))</span>
<span class="p">(</span><span class="nv">global-set-key</span> <span class="p">(</span><span class="nv">kbd</span> <span class="s">"C-x r"</span><span class="p">)</span> <span class="p">(</span><span class="nv">quick-compile</span> <span class="s">"run"</span><span class="p">))</span>
</code></pre></div></div>

<p>Each of those invokes a different target without second guessing me.
Let me tell you, having “clean” at the tip of my fingers is wonderful.</p>

<h3 id="parallel-builds">Parallel Builds</h3>

<p>An extension common to many different <code class="language-plaintext highlighter-rouge">make</code> programs is <code class="language-plaintext highlighter-rouge">-j</code>, which
asks <code class="language-plaintext highlighter-rouge">make</code> to build targets in parallel where possible. These days
where multi-core machines are the norm, you nearly always want to use
this option, ideally set to the number of logical processor cores on
your system. It’s a huge time-saver.</p>

<p>My recent revelation was that my default build command could be
better: <code class="language-plaintext highlighter-rouge">make -k</code> is minimal. It should at least include <code class="language-plaintext highlighter-rouge">-j</code>, but
choosing an argument (number of processor cores) is a problem. Today I
use different machines with 2, 4, or 8 cores, so most of the time any
given number will be wrong. I could use a per-system configuration,
but I’d rather not. Unfortunately GNU Make will not automatically
detect the number of cores. That leaves the matter up to Emacs Lisp.</p>

<p>Emacs doesn’t currently have a built-in function that returns the
number of processor cores. I’ll need to reach into the operating
system to figure it out. My usual development environments are Linux,
Windows, and OpenBSD, so my solution should work on each. I’ve ranked
them by order of importance.</p>

<h4 id="number-of-cores-on-linux">Number of cores on Linux</h4>

<p>Linux has the <code class="language-plaintext highlighter-rouge">/proc</code> virtual filesystem in the fashion of Plan 9,
allowing different aspects of the system to be explored through the
standard filesystem API. The relevant file here is <code class="language-plaintext highlighter-rouge">/proc/cpuinfo</code>,
listing useful information about each of the system’s processors. To
get the number of processors, count the number of processor entries in
this file. I’ve wrapped it in <code class="language-plaintext highlighter-rouge">if-file-exists</code> so that it returns
<code class="language-plaintext highlighter-rouge">nil</code> on other operating systems instead of throwing an error.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nv">file-exists-p</span> <span class="s">"/proc/cpuinfo"</span><span class="p">)</span>
  <span class="p">(</span><span class="nv">with-temp-buffer</span>
    <span class="p">(</span><span class="nv">insert-file-contents</span> <span class="s">"/proc/cpuinfo"</span><span class="p">)</span>
    <span class="p">(</span><span class="nv">how-many</span> <span class="s">"^processor[[:space:]]+:"</span><span class="p">)))</span>
</code></pre></div></div>

<h4 id="number-of-cores-on-windows">Number of cores on Windows</h4>

<p>When I was first researching how to do this on Windows, I thought I
would need to invoke the <code class="language-plaintext highlighter-rouge">wmic</code> command line program and hope the
output could be parsed the same way on different versions of the
operating system and tool. However, it turns out the solution for
Windows is trivial. The environment variable <code class="language-plaintext highlighter-rouge">NUMBER_OF_PROCESSORS</code>
gives every process the answer for free. Being an environment
variable, it will need to be parsed.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">number-of-processors</span> <span class="p">(</span><span class="nv">getenv</span> <span class="s">"NUMBER_OF_PROCESSORS"</span><span class="p">)))</span>
  <span class="p">(</span><span class="nb">when</span> <span class="nv">number-of-processors</span>
    <span class="p">(</span><span class="nv">string-to-number</span> <span class="nv">number-of-processors</span><span class="p">)))</span>
</code></pre></div></div>

<h4 id="number-of-cores-on-bsd">Number of cores on BSD</h4>

<p>This seems to work the same across all the BSDs, including OS X,
though I haven’t yet tested it exhaustively. Invoke <code class="language-plaintext highlighter-rouge">sysctl</code>, which
returns an undecorated number to be parsed.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">with-temp-buffer</span>
  <span class="p">(</span><span class="nb">ignore-errors</span>
    <span class="p">(</span><span class="nb">when</span> <span class="p">(</span><span class="nb">zerop</span> <span class="p">(</span><span class="nv">call-process</span> <span class="s">"sysctl"</span> <span class="no">nil</span> <span class="no">t</span> <span class="no">nil</span> <span class="s">"-n"</span> <span class="s">"hw.ncpu"</span><span class="p">))</span>
      <span class="p">(</span><span class="nv">string-to-number</span> <span class="p">(</span><span class="nv">buffer-string</span><span class="p">)))))</span>
</code></pre></div></div>

<p>Also not complicated, but it’s the heaviest solution of the three.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>Join all these together with <code class="language-plaintext highlighter-rouge">or</code>, call it <code class="language-plaintext highlighter-rouge">numcores</code>, and ta-da.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">setf</span> <span class="nv">quick-compile-command</span> <span class="p">(</span><span class="nb">format</span> <span class="s">"make -kj%d"</span> <span class="p">(</span><span class="nv">numcores</span><span class="p">)))</span>
</code></pre></div></div>

<p>Now <code class="language-plaintext highlighter-rouge">make</code> is invoked correctly on any system by default.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Recovering Live Data with GDB</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/09/15/"/>
    <id>urn:uuid:5fa83dc1-2d5c-3313-b2b9-f4fb73ef5d9e</id>
    <updated>2015-09-15T14:53:44Z</updated>
    <category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>I recently ran into a problem where <a href="https://github.com/skeeto/reddit-related">long-running program</a>
output was trapped in a C <code class="language-plaintext highlighter-rouge">FILE</code> buffer. The program had been running
for two days straight printing its results, but the last few kilobytes
of output were missing. It wouldn’t output these last bytes until the
program completed its day-long (or worse!) cleanup operation and
exited. This is easy to fix — and, honestly, the cleanup step was
unnecessary anyway — but I didn’t want to start all over and wait
two more days to recompute the result.</p>

<p>Here’s a minimal example of the situation. The first loop represents
the long-running computation and the infinite loop represents a
cleanup job that will never complete.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="cm">/* Compute output. */</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%d/%d "</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span> <span class="o">*</span> <span class="n">i</span><span class="p">);</span>
    <span class="n">putchar</span><span class="p">(</span><span class="sc">'\n'</span><span class="p">);</span>

    <span class="cm">/* "Slow" cleanup operation ... */</span>
    <span class="k">for</span> <span class="p">(;;)</span>
        <span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="buffered-output-review">Buffered Output Review</h3>

<p>Both <code class="language-plaintext highlighter-rouge">printf</code> and <code class="language-plaintext highlighter-rouge">putchar</code> are C library functions and are usually
buffered in some way. That is, each call to these functions doesn’t
necessarily send data out of the program. This is in contrast to the
POSIX functions <code class="language-plaintext highlighter-rouge">read</code> and <code class="language-plaintext highlighter-rouge">write</code>, which are unbuffered system calls.
Since system calls are relatively expensive, buffered input and output
is used to change a large number of system calls on small buffers into
a single system call on a single large buffer.</p>

<p>Typically, stdout is <em>line-buffered</em> if connected to a terminal. When
the program completes a line of output, the user probably wants to see
it immediately. So, if you compile the example program and run it at
your terminal you will probably see the output before the program
hangs on the infinite loop.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cc -std=c99 example.c
$ ./a.out
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81
</code></pre></div></div>

<p>However, when stdout is connected to a file or pipe, it’s generally
buffered to something like 4kB. For this program, the output will
remain empty no matter how long you wait. It’s trapped in a <code class="language-plaintext highlighter-rouge">FILE</code>
buffer in process memory.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./a.out &gt; output.txt
</code></pre></div></div>

<p>The primary way to fix this is to use the <code class="language-plaintext highlighter-rouge">fflush</code> function, to force
the buffer empty before starting a long, non-output operation.
Unfortunately for me I didn’t think of this two days earlier.</p>

<h3 id="debugger-to-the-rescue">Debugger to the Rescue</h3>

<p>Fortunately there <em>is</em> a way to interrupt a running program and
manipulate its state: a debugger. First, find the process ID of the
running program (the one writing to <code class="language-plaintext highlighter-rouge">output.txt</code> above).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pgrep a.out
12934
</code></pre></div></div>

<p>Now attach GDB, which will pause the program’s execution.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gdb ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
gdb&gt; attach 12934
Attaching to program: /tmp/a.out, process 12934
... snip ...
0x0000000000400598 in main ()
gdb&gt;
</code></pre></div></div>

<p>From here I could examine the stdout <code class="language-plaintext highlighter-rouge">FILE</code> struct and try to extract
the buffer contents by hand. However, the easiest thing is to do is
perform the call I forgot in the first place: <code class="language-plaintext highlighter-rouge">fflush(stdout)</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb&gt; call fflush(stdout)
$1 = 0
gdb&gt; quit
Detaching from program: /tmp/a.out, process 12934
</code></pre></div></div>

<p>The program is still running, but the output has been recovered.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat output.txt
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81
</code></pre></div></div>

<h3 id="why-cleanup">Why Cleanup?</h3>

<p>As I said, in my case the cleanup operation was entirely unnecessary,
so it would be safe to just kill the program at this point. It was
taking a really long time to tear down a humongous data structure (on
the order of 50GB) one little node at a time with <code class="language-plaintext highlighter-rouge">free</code>. Obviously,
the memory would be freed much more quickly by the OS when the program
exited.</p>

<p>Freeing memory in the program was only to satisfy <a href="http://valgrind.org/">Valgrind</a>,
since it’s so incredibly useful for debugging. Not freeing the data
structure would hide actual memory leaks in Valgrind’s final report.
For the real “production” run, I should have disabled cleanup.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Mandelbrot Set with SIMD Intrinsics</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/07/10/"/>
    <id>urn:uuid:daec6c9d-346a-3e22-a7b9-486e713c5e5d</id>
    <updated>2015-07-10T19:46:45Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>When I started this blog 8 years ago, <a href="/blog/2007/09/02/">my first post</a> was about
the Mandelbrot set. Since then, both technology and my own skills have
improved (or so I like to believe!), so I’m going to take another look
at it, this time using three different <em>Single Instruction, Multiple
Data</em> (SIMD) instruction sets: SSE2, AVX, and NEON. The latter two
didn’t exist when the last article was published. In this article I
demonstrate SIMD bringing a 5.8x speedup to a fractal renderer.</p>

<p>If you want to take a look at my code before reading further:</p>

<ul>
  <li><a href="https://github.com/skeeto/mandel-simd">https://github.com/skeeto/mandel-simd</a></li>
</ul>

<p><img src="/img/fractal/mandel-plain-small.png" alt="" /></p>

<p>Having multiple CPU cores allows different instructions to operation
on (usually) different data independently. In contrast, under SIMD a
specific operation (single instruction) acts upon several values
(multiple data) at once. It’s another form of parallelization. For
example, with image processing — perhaps the most common use case —
this means multiple pixels could be computed within the same number of
cycles it would normally take to compute just one. SIMD is generally
implemented on CPUs through wide registers: 64, 128, 256, and even 512
bits wide. Values are packed into the register like an array and are
operated on independently, generally with <em>saturation arithmetic</em>
(clamped, non-wrapping).</p>

<p>Rather than hand-code all this in assembly, I’m using yet another
technique I picked up from the always-educational <a href="https://handmadehero.org/">Handmade Hero</a>:
compiler intrinsics. The code is all C, but in place of C’s operators
are pseudo-function calls operating on special SIMD types. These
aren’t actual function calls, they’re intrinsics. The compiler will
emit a specific assembly instruction for each intrinsic, sort of like
an inline function. This is more flexible for mixing with other C
code, the compiler will manage all the registers, and the compiler
will attempt to re-order and interleave instructions to maximize
throughput. It’s a big win!</p>

<h3 id="some-simd-history">Some SIMD History</h3>

<p>The first widely consumer available SIMD hardware was probably the MMX
instruction set, introduced to 32-bit x86 in 1997. This provided 8
64-bit <code class="language-plaintext highlighter-rouge">mm0</code> - <code class="language-plaintext highlighter-rouge">mm7</code>, registers aliasing the older x87 floating
pointer registers, which operated on packed integer values. This was
extended by AMD with its 3DNow! instruction set, adding floating point
instructions.</p>

<p>However, you don’t need to worry about any of that because these both
were superseded by <em>Streaming SIMD Extensions</em> (SSE) in 1999. SSE has
128-bit registers — confusingly named <code class="language-plaintext highlighter-rouge">xmm0</code> - <code class="language-plaintext highlighter-rouge">xmm7</code> — and a much
richer instruction set. SSE has been extended with SSE2 (2001), SSE3
(2004), SSSE3 (2006), SSE4.1 (2007), and SSE4.2 (2008). x86-64 doesn’t
have SSE2 as an extension but instead as a core component of the
architecture (adding <code class="language-plaintext highlighter-rouge">xmm8</code>- <code class="language-plaintext highlighter-rouge">xmm15</code>), baking it into its ABI.</p>

<p>In 2009, ARM introduced the NEON instruction set as part of ARMv6.
Like SSE, it has 128-bit registers, but its instruction set is more
consistent and uniform. One of its most visible features over SSE is a
<em>stride</em> load parameter making it flexible for a wider variety data
arrangements. NEON is available on your <a href="https://www.raspberrypi.org/">Raspberry Pi</a>, which is
why I’m using it here.</p>

<p>In 2011, Intel and AMD introduced the <em>Advanced Vector Extensions</em>
(AVX) instruction set. Essentially it’s SSE with 256-bit registers,
named <code class="language-plaintext highlighter-rouge">ymm0</code> - <code class="language-plaintext highlighter-rouge">ymm15</code>. That means operating on 8 single-precision
floats at once! As of this writing, this extensions is just starting
to become commonplace on desktops and laptops. It also has extensions:
AVX2 (2013) and AVX-512 (2015).</p>

<h3 id="starting-with-c">Starting with C</h3>

<p>Moving on to the code, in <code class="language-plaintext highlighter-rouge">mandel.c</code> you’ll find <code class="language-plaintext highlighter-rouge">mandel_basic</code>, a
straight C implementation that produces a monochrome image. Normally I
would post the code here within the article, but it’s 30 lines long
and most of it isn’t of any particular interest.</p>

<p>I didn’t use C99’s complex number support because — continuing to
follow the approach Handmade Hero — I intended to port this code
directly into SIMD intrinsics. It’s much easier to work from a
straight non-SIMD implementation towards one with compiler intrinsics
than coding with compiler intrinsics right away. In fact, I’d say it’s
<em>almost</em> trivial, since I got it right the first attempt on all three.</p>

<p>There’s just one unusual part:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma omp parallel for schedule(dynamic, 1)
</span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
   <span class="cm">/* ... */</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is an Open Multi-Processing (OpenMP) pragma. It’s a higher-level
threading API than POSIX or Win32 threads. OpenMP takes care of all
thread creation, work scheduling, and cleanup. In this case, the <code class="language-plaintext highlighter-rouge">for</code>
loop is parallelized such that each row of the image will be scheduled
individually to a thread, with one thread spawned for each CPU core.
This one line saves all the trouble of managing a work queue and such.
I also use it in my SIMD implementations, composing both forms of
parallelization for maximum performance.</p>

<p>I did it in single precision because I really want to exploit SIMD.
Obviously, being half as wide as double precision, twice an many
single precision operands can fit in a SIMD register.</p>

<p>On my wife’s i7-4770 (8 logical cores), <strong>it takes 29.9ms to render
one image</strong> using the defaults (1440x1080, real{-2.5, 1.5}, imag{-1.5,
1.5}, 256 iterations). I’ll use the same machine for both the SSE2 and
AVX benchmarks.</p>

<h4 id="sse2-mandelbrot-set">SSE2 Mandelbrot Set</h4>

<p>The first translation I did was SSE2 (<code class="language-plaintext highlighter-rouge">mandel_sse2.c</code>). As with just
about any optimization, it’s more complex and harder to read than the
straight version. Again, I won’t post the code here, especially when
this one has doubled to 60 lines long.</p>

<p>Porting to SSE2 (and SIMD in general) is simply a matter of converting
all assignments and arithmetic operators to their equivalent
intrinsics. The <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel Intrinsics Guide</a> is a godsend for this
step. It’s easy to search for specific operations and it tells you
what headers they come from. Notice that there are no C arithmetic
operators until the very end, after the results have been extracted
from SSE and pixels are being written.</p>

<p>There are two new types present in this version, <code class="language-plaintext highlighter-rouge">__m128</code> and
<code class="language-plaintext highlighter-rouge">__m128i</code>. These will be mapped to SSE registers by the compiler, sort
of like the old (outdated) C <code class="language-plaintext highlighter-rouge">register</code> keyword. One big difference is
that it’s legal to take the address of these values with <code class="language-plaintext highlighter-rouge">&amp;</code>, and the
compiler will worry about the store/load operations. The first type is
for floating point values and the second is for integer values. At
first it’s annoying for these to be separate types (the CPU doesn’t
care), but it becomes a set of compiler-checked rails for avoiding
mistakes.</p>

<p>Here’s how assignment was written in the straight C version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">iter_scale</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">/</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">iterations</span><span class="p">;</span>
</code></pre></div></div>

<p>And here’s the SSE version. SSE intrinsics are prefixed with <code class="language-plaintext highlighter-rouge">_mm</code>,
and the “ps” stands for “packed single-precision.”</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__m128</span> <span class="n">iter_scale</span> <span class="o">=</span> <span class="n">_mm_set_ps1</span><span class="p">(</span><span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span> <span class="o">/</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">iterations</span><span class="p">);</span>
</code></pre></div></div>

<p>This sets all four <em>lanes</em> of the register to the same value (a
<em>broadcast</em>). Lanes can also be assigned individually, such as at the
beginning of the innermost loop.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__m128</span> <span class="n">mx</span> <span class="o">=</span> <span class="n">_mm_set_ps</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="mi">3</span><span class="p">,</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">2</span><span class="p">,</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>

<p>This next part shows why the SSE2 version is longer. Here’s the
straight C version:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">zr1</span> <span class="o">=</span> <span class="n">zr</span> <span class="o">*</span> <span class="n">zr</span> <span class="o">-</span> <span class="n">zi</span> <span class="o">*</span> <span class="n">zi</span> <span class="o">+</span> <span class="n">cr</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">zi1</span> <span class="o">=</span> <span class="n">zr</span> <span class="o">*</span> <span class="n">zi</span> <span class="o">+</span> <span class="n">zr</span> <span class="o">*</span> <span class="n">zi</span> <span class="o">+</span> <span class="n">ci</span><span class="p">;</span>
<span class="n">zr</span> <span class="o">=</span> <span class="n">zr1</span><span class="p">;</span>
<span class="n">zi</span> <span class="o">=</span> <span class="n">zi1</span><span class="p">;</span>
</code></pre></div></div>

<p>To make it easier to read in the absence of operator syntax, I broke
out the intermediate values. Here’s the same operation across four
different complex values simultaneously. The purpose of these
intrinsics should be easy to guess from their names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__m128</span> <span class="n">zr2</span> <span class="o">=</span> <span class="n">_mm_mul_ps</span><span class="p">(</span><span class="n">zr</span><span class="p">,</span> <span class="n">zr</span><span class="p">);</span>
<span class="n">__m128</span> <span class="n">zi2</span> <span class="o">=</span> <span class="n">_mm_mul_ps</span><span class="p">(</span><span class="n">zi</span><span class="p">,</span> <span class="n">zi</span><span class="p">);</span>
<span class="n">__m128</span> <span class="n">zrzi</span> <span class="o">=</span> <span class="n">_mm_mul_ps</span><span class="p">(</span><span class="n">zr</span><span class="p">,</span> <span class="n">zi</span><span class="p">);</span>
<span class="n">zr</span> <span class="o">=</span> <span class="n">_mm_add_ps</span><span class="p">(</span><span class="n">_mm_sub_ps</span><span class="p">(</span><span class="n">zr2</span><span class="p">,</span> <span class="n">zi2</span><span class="p">),</span> <span class="n">cr</span><span class="p">);</span>
<span class="n">zi</span> <span class="o">=</span> <span class="n">_mm_add_ps</span><span class="p">(</span><span class="n">_mm_add_ps</span><span class="p">(</span><span class="n">zrzi</span><span class="p">,</span> <span class="n">zrzi</span><span class="p">),</span> <span class="n">ci</span><span class="p">);</span>
</code></pre></div></div>

<p>There are a bunch of swizzle instructions added in SSSE3 and beyond
for re-arranging bytes within registers. With those I could eliminate
that last bit of non-SIMD code at the end of the function for packing
pixels. In an earlier version I used them, but since pixel packing
isn’t a hot spot in this code (it’s outside the tight, innermost
loop), it didn’t impact the final performance, so I took it out for
the sake of simplicity.</p>

<p><strong>The running time is now 8.56ms per image</strong>, a 3.5x speedup. That’s
close to the theoretical 4x speedup from moving to 4-lane SIMD. That’s
fast enough to render fullscreen at 60FPS.</p>

<h4 id="avx-mandelbrot-set">AVX Mandelbrot Set</h4>

<p>With SSE2 explained, there’s not much to say about AVX
(<code class="language-plaintext highlighter-rouge">mandel_avx.c</code>). The only difference is the use of <code class="language-plaintext highlighter-rouge">__m256</code>,
<code class="language-plaintext highlighter-rouge">__m256i</code>, the <code class="language-plaintext highlighter-rouge">_mm256</code> intrinsic prefix, and that this operates on 8
points on the complex plane instead of 4.</p>

<p>It’s interesting that the AVX naming conventions are subtly improved
over SSE. For example, here are the SSE broadcast intrinsics.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">_mm_set1_epi8</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm_set1_epi16</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm_set1_epi32</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm_set1_epi64x</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm_set1_pd</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm_set_ps1</code></li>
</ul>

<p>Notice the oddball at the end? That’s discrimination against sufferers
of obsessive-compulsive personality disorder. This was fixed in AVX’s
broadcast intrinsics:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_epi8</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_epi16</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_epi32</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_epi64x</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_pd</code></li>
  <li><code class="language-plaintext highlighter-rouge">_mm256_set1_ps</code></li>
</ul>

<p><strong>The running time here is 5.20ms per image</strong>, a 1.6x speedup from
SSE2. That’s not too far from the theoretical 2x speedup from using
twice as many lanes. We can render at 60FPS and spend most of the time
waiting around for the next vsync.</p>

<h4 id="neon-mandelbrot-set">NEON Mandelbrot Set</h4>

<p>NEON is ARM’s take on SIMD. It’s what you’d find on your phone and
tablet rather than desktop or laptop. NEON <a href="http://people.xiph.org/~tterribe/daala/neon_tutorial.pdf">behaves much like a
co-processor</a>: NEON instructions are (cheaply) dispatched
asynchronously to their own instruction pipeline, but transferring
data back out of NEON is expensive and will stall the ARM pipeline
until the NEON pipeline catches up.</p>

<p>Going beyond <code class="language-plaintext highlighter-rouge">__m128</code> and <code class="language-plaintext highlighter-rouge">__m256</code>, <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/ARM-NEON-Intrinsics.html">NEON intrinsics</a> have a
type for each of the possible packings. On x86, the old stack-oriented
x87 floating-point instructions are replaced with SSE single-value
(“ss”, “sd”) instructions. On ARM, there’s no reason to use NEON to
operate on single values, so these “packings” don’t exist. Instead
there are half-wide packings. Note the lack of double-precision
support.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">float32x2_t</code>, <code class="language-plaintext highlighter-rouge">float32x4_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">int16x4_t</code>, <code class="language-plaintext highlighter-rouge">int16x8_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">int32x2_t</code>, <code class="language-plaintext highlighter-rouge">int32x4_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">int64x1_t</code>, <code class="language-plaintext highlighter-rouge">int64x2_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">int8x16_t</code>, <code class="language-plaintext highlighter-rouge">int8x8_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">uint16x4_t</code>, <code class="language-plaintext highlighter-rouge">uint16x8_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">uint32x2_t</code>, <code class="language-plaintext highlighter-rouge">uint32x4_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">uint64x1_t</code>, <code class="language-plaintext highlighter-rouge">uint64x2_t</code></li>
  <li><code class="language-plaintext highlighter-rouge">uint8x16_t</code>, <code class="language-plaintext highlighter-rouge">uint8x8_t</code></li>
</ul>

<p>Again, the CPU doesn’t really care about any of these types. It’s all
to help the compiler help us. For example, we don’t want to multiply a
<code class="language-plaintext highlighter-rouge">float32x4_t</code> and a <code class="language-plaintext highlighter-rouge">float32x2_t</code> since it wouldn’t have a meaningful
result.</p>

<p>Otherwise everything is similar (<code class="language-plaintext highlighter-rouge">mandel_neon.c</code>). NEON intrinsics are
(less-cautiously) prefixed with <code class="language-plaintext highlighter-rouge">v</code> and suffixed with a type (<code class="language-plaintext highlighter-rouge">_f32</code>,
<code class="language-plaintext highlighter-rouge">_u32</code>, etc.).</p>

<p>The performance on my model Raspberry Pi 2 (900 MHz quad-core ARM
Cortex-A7) <strong>is 545ms per frame without NEON and 232ms with NEON</strong>, a
2.3x speedup. This isn’t nearly as impressive as SSE2, also at 4
lanes. My implementation almost certainly needs more work, especially
since I know less about ARM than x86.</p>

<h3 id="compiling-with-intrinsics">Compiling with Intrinsics</h3>

<p>For the x86 build, I wanted the same binary to have AVX, SSE2, and
plain C versions, selected by a command line switch and feature
availability, so that I could easily compare benchmarks. Without any
special options, gcc and clang will make conservative assumptions
about the CPU features of the target machine. In order to build using
AVX intrinsics, I need the compiler to assume the target has AVX. The
<code class="language-plaintext highlighter-rouge">-mavx</code> argument does this.</p>

<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">mandel_avx.o </span><span class="o">:</span> <span class="nf">mandel_avx.c</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">-mavx</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">$^</span>

<span class="nl">mandel_sse2.o </span><span class="o">:</span> <span class="nf">mandel_sse2.c</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="err">-msse2</span> <span class="err">-o</span> <span class="err">$@</span> <span class="err">$^</span>

<span class="nl">mandel_neon.o </span><span class="o">:</span> <span class="nf">mandel_neon.c</span>
    <span class="err">$(CC)</span> <span class="err">-c</span> <span class="err">$(CFLAGS)</span> <span class="nv">-mfpu</span><span class="o">=</span>neon <span class="nt">-o</span> <span class="nv">$@</span> <span class="nv">$^</span>
</code></pre></div></div>

<p>All x86-64 CPUs have SSE2 but I included it anyway for clarity. But it
should also enable it for 32-bit x86 builds.</p>

<p>It’s absolutely critical that each is done in a separate translation
unit. Suppose I compiled like so in one big translation unit,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc -msse2 -mavx mandel.c mandel_sse2.c mandel_avx.c
</code></pre></div></div>

<p>The compiler will likely use some AVX instructions outside of the
explicit intrinsics, meaning it’s going to crash on machine without
AVX (“illegal instruction”). The main program needs to be compiled
with AVX disabled. That’s where it will test for AVX before executing
any special instructions.</p>

<h4 id="feature-testing">Feature Testing</h4>

<p>Intrinsics are well-supported across different compilers (surprisingly,
even including the late-to-the-party Microsoft). Unfortunately testing
for CPU features differs across compilers. Intel advertises a
<code class="language-plaintext highlighter-rouge">_may_i_use_cpu_feature</code> intrinsic, but it’s not supported in either
gcc or clang. gcc has a <code class="language-plaintext highlighter-rouge">__builtin_cpu_supports</code> built-in, but it’s
only supported by gcc.</p>

<p>The most portable solution I came up with is <code class="language-plaintext highlighter-rouge">cpuid.h</code> (x86 specific).
It’s supported by at least gcc and clang. The clang version of the
header is <em>much</em> better documented, so if you want to read up on how
this works, read that one.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;cpuid.h&gt;</span><span class="cp">
</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span>
<span class="nf">is_avx_supported</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">eax</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">ebx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">ecx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">edx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">__get_cpuid</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">eax</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ebx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ecx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">edx</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">ecx</span> <span class="o">&amp;</span> <span class="n">bit_AVX</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And in use:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">use_avx</span> <span class="o">&amp;&amp;</span> <span class="n">is_avx_supported</span><span class="p">())</span>
    <span class="n">mandel_avx</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">spec</span><span class="p">);</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">use_sse2</span><span class="p">)</span>
    <span class="n">mandel_sse2</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">spec</span><span class="p">);</span>
<span class="k">else</span>
    <span class="nf">mandel_basic</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">spec</span><span class="p">);</span>
</code></pre></div></div>

<p>I don’t know how to test for NEON, nor do I have the necessary
hardware to test it, so on ARM assume it’s always available.</p>

<h3 id="conclusion">Conclusion</h3>

<p>Using SIMD intrinsics for the Mandelbrot set was just an exercise to
learn how to use them. Unlike in Handmade Hero, where it makes a 1080p
60FPS software renderer feasible, I don’t have an immediate, practical
use for CPU SIMD, but, like so many similar techniques, I like having
it ready in my toolbelt for the next time an opportunity arises.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Minimal OpenGL 3.3 Core Profile Demo</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/06/06/"/>
    <id>urn:uuid:ada32e48-ae67-3da5-9772-7e61fee602c3</id>
    <updated>2015-06-06T19:35:34Z</updated>
    <category term="opengl"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>When I was first attempting to learn OpenGL years ago, what I really
wanted was a complete, minimal example program. OpenGL has enormous
flexibility and I wanted to <a href="http://www.skorks.com/2010/04/on-the-value-of-fundamentals-in-software-development/">fully understand the fundamentals</a>
in isolation before moving on to more advanced features. I had been
advised to specifically learn <em>core profile</em>, which drops nearly all
the legacy parts of the API.</p>

<p>However, since much of the OpenGL-related content to be found online,
even today, <a href="http://www.shamusyoung.com/twentysidedtale/?p=23079">is outdated</a> — and, worse, it’s not marked as
such — good, modern core profile examples have been hard to come by.
The relevant examples I could find at the time were more complicated
than necessary, due to the common problem that full 3D graphics are
too closely conflated with OpenGL. The examples would include matrix
libraries, texture loading, etc. This is a big reason I ended up
<a href="/blog/2013/06/10/">settling on WebGL</a>: a clean slate in a completely different
community. (The good news is that this situation has already improved
dramatically over the last few years!)</p>

<p>Until recently, <a href="/tags/webgl/">all of my OpenGL experience had been WebGL</a>.
Wanting to break out of that, earlier this year I set up a minimal
OpenGL 3.3 core profile demo in C, using <a href="http://www.glfw.org/">GLFW</a> and
<a href="https://github.com/skaslev/gl3w">gl3w</a>. You can find it here:</p>

<ul>
  <li><a href="https://github.com/skeeto/opengl-demo">https://github.com/skeeto/opengl-demo</a></li>
</ul>

<p>No 3D graphics, no matrix library, no textures. It’s just a spinning
red square.</p>

<p><img src="/img/screenshot/opengl-demo.png" alt="" /></p>

<p>It supports both Linux and Windows. The Windows’ build is static, so
it compiles to a single, easily distributable, standalone binary. With
some minor tweaking it would probably support the BSDs as well. For
simplicity’s sake, the shaders are baked right into the source as
strings, but if you’re extending the demo for your own use, you may
want to move them out into their own source files.</p>

<h3 id="why-opengl-33">Why OpenGL 3.3?</h3>

<p>I chose OpenGL 3.3 in particular for three reasons:</p>

<ul>
  <li>Core and compatibility profiles were introduced in OpenGL 3.2
(2009). Obviously anything that focuses on core profile is going to
be 3.2 and up.</li>
  <li>OpenGL 3.3 (2010) <a href="https://www.opengl.org/wiki/History_of_OpenGL#OpenGL_3.3_.282010.29">introduced version 3.30</a> of the shading
language. This was a big deal and there’s little reason not to take
advantage of it. I specifically wanted to use the new <code class="language-plaintext highlighter-rouge">layout</code>
keyword.</li>
  <li>Mesa 10.0 (2013) <a href="http://en.wikipedia.org/wiki/Mesa_%28computer_graphics%29#Implementations_of_rendering_APIs">supports up to OpenGL 3.3</a>. Mesa is the
prominent 3D graphics library for open source operating systems.
It’s what applications use for both hardware-accelerated and
software OpenGL rendering. This means the demo will work on any
modern Linux installation. (Note: when running on older hardware
without OpenGL 3.3 support, you may have to force software rendering
with the environment variable <code class="language-plaintext highlighter-rouge">LIBGL_ALWAYS_SOFTWARE=1</code>. The
software renderer will take advantage of your CPU’s SIMD features.)</li>
</ul>

<p>As far as “desktop” OpenGL goes, 3.3 is currently <em>the</em> prime target.</p>

<h3 id="why-glfw">Why GLFW?</h3>

<p>Until <a href="https://www.khronos.org/egl/">EGL</a> someday fills this role, the process for obtaining an
OpenGL context is specific to each operating system, where it’s
generally a pain in the butt. GLUT, the OpenGL Utility Toolkit, was a
library to make this process uniform across the different platforms.
It also normalized user input (keyboard and mouse) and provided some
basic (and outdated) utility functions.</p>

<p>The original GLUT isn’t quite open source (licensing issues) and it’s
no longer maintained. The open source replacement for GLUT is
<a href="http://freeglut.sourceforge.net/">FreeGLUT</a>. It’s what you’d typically find on a Linux
system in place of the original GLUT.</p>

<p>I just need a portable library that creates a window, handles keyboard
and mouse events in that window, and gives me an OpenGL 3.3 core
profile context. FreeGLUT does this well, but we can do better. One
problem is that it includes a whole bunch of legacy cruft from GLUT:
immediate mode rendering utilities, menus, spaceball support, lots of
global state, and only one OpenGL context per process.</p>

<p>One of the biggest problems is that <strong>FreeGLUT doesn’t have a swap
interval function</strong>. This is used to lock the application’s redraw
rate to the system’s screen refresh rate, preventing screen tearing
and excessive resource consumption. I originally used FreeGLUT for the
demo, and, as a workaround, had added my own macro work around this by
finding the system’s swap interval function, but it was a total hack.</p>

<p>The demo was initially written with FreeGLUT, but I switched over to
<a href="http://www.glfw.org/">GLFW</a> since it’s smaller, simpler, cleaner, and more modern.
GLFW also has portable joystick handling. With the plethora of modern
context+window creation libraries out there, it seems there’s not much
reason to use FreeGLUT anymore.</p>

<p><a href="https://www.libsdl.org/">SDL 2.0</a> would also be an excellent choice. It goes beyond GLFW
with threading, audio, networking, image loading, and timers:
basically all the stuff you’d need when writing a game.</p>

<p>I’m sure there are some other good alternatives, especially when
you’re not sticking to plain C, but these are the libraries I’m
familiar with at the time of this article.</p>

<h3 id="why-gl3w">Why gl3w?</h3>

<p>If you didn’t think the interface between OpenGL and the operating
system was messy enough, I have good news for you. Neither the
operating system nor the video card drivers are going to provide any
of the correct headers, nor will you have anything meaningful to link
against! For these, you’re on your own.</p>

<p>The OpenGL Extension Wrangler Library (GLEW) was invented solve this
problem. It dynamically loads the system’s OpenGL libraries and finds
all the relevant functions at run time. That way your application
avoids linking to anything too specific. At compile time, it provides
the headers defining all of the OpenGL functions.</p>

<p>Over the years, GLEW has become outdated, to this day having no
support for core profile. So instead I used a replacement called
<a href="https://github.com/skaslev/gl3w">gl3w</a>. It’s just like GLEW, but, as the name suggests, oriented
around core profile … exactly what I needed. Unlike GLEW, it is
generated directly from Kronos’ documentation by a script. In
practice, you drop the generated code directly into your project
(embedded) rather than rely on the system to provide it as a library.</p>

<p>A great (and probably better) alternative to gl3w is
<a href="https://bitbucket.org/alfonse/glloadgen/wiki/Home">glLoadgen</a>. It’s the same idea — an automatically
generated OpenGL loader — but allows for full customization of the
output, such as the inclusion of select OpenGL extensions.</p>

<h3 id="conclusion">Conclusion</h3>

<p>While I hope it serves an educational resources for others, I
primarily have it for my own record-keeping, pedagogical, and
reference purposes, born out of a weekend’s worth of research. It’s a
starting point for future projects, and it’s somewhere easy to start
when I want to experiment with an idea.</p>

<p>Plus, someday I want to write a sweet, standalone game with fancy
OpenGL graphics.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Raw Linux Threads via System Calls</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/05/15/"/>
    <id>urn:uuid:9d5de15b-9308-3715-2bd7-565d6649ab2f</id>
    <updated>2015-05-15T17:33:40Z</updated>
    <category term="x86"/><category term="linux"/><category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article has <a href="/blog/2016/09/23/">a followup</a>.</em></p>

<p>Linux has an elegant and beautiful design when it comes to threads:
threads are nothing more than processes that share a virtual address
space and file descriptor table. Threads spawned by a process are
additional child processes of the main “thread’s” parent process.
They’re manipulated through the same process management system calls,
eliminating the need for a separate set of thread-related system
calls. It’s elegant in the same way file descriptors are elegant.</p>

<p>Normally on Unix-like systems, processes are created with fork(). The
new process gets its own address space and file descriptor table that
starts as a copy of the original. (Linux uses copy-on-write to do this
part efficiently.) However, this is too high level for creating
threads, so Linux has a separate <a href="http://man7.org/linux/man-pages/man2/clone.2.html">clone()</a> system call. It
works just like fork() except that it accepts a number of flags to
adjust its behavior, primarily to share parts of the parent’s
execution context with the child.</p>

<p>It’s <em>so</em> simple that <strong>it takes less than 15 instructions to spawn a
thread with its own stack</strong>, no libraries needed, and no need to call
Pthreads! In this article I’ll demonstrate how to do this on x86-64.
All of the code with be written in <a href="http://www.nasm.us/">NASM</a> syntax since, IMHO,
it’s by far the best (see: <a href="/blog/2015/04/19/">nasm-mode</a>).</p>

<p>I’ve put the complete demo here if you want to see it all at once:</p>

<ul>
  <li><a href="https://github.com/skeeto/pure-linux-threads-demo">Pure assembly, library-free Linux threading demo</a></li>
</ul>

<h3 id="an-x86-64-primer">An x86-64 Primer</h3>

<p>I want you to be able to follow along even if you aren’t familiar with
x86_64 assembly, so here’s a short primer of the relevant pieces. If
you already know x86-64 assembly, feel free to skip to the next
section.</p>

<p>x86-64 has 16 64-bit <em>general purpose registers</em>, primarily used to
manipulate integers, including memory addresses. There are <em>many</em> more
registers than this with more specific purposes, but we won’t need
them for threading.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rsp</code> : stack pointer</li>
  <li><code class="language-plaintext highlighter-rouge">rbp</code> : “base” pointer (still used in debugging and profiling)</li>
  <li><code class="language-plaintext highlighter-rouge">rax</code> <code class="language-plaintext highlighter-rouge">rbx</code> <code class="language-plaintext highlighter-rouge">rcx</code> <code class="language-plaintext highlighter-rouge">rdx</code> : general purpose (notice: a, b, c, d)</li>
  <li><code class="language-plaintext highlighter-rouge">rdi</code> <code class="language-plaintext highlighter-rouge">rsi</code> : “destination” and “source”, now meaningless names</li>
  <li><code class="language-plaintext highlighter-rouge">r8</code> <code class="language-plaintext highlighter-rouge">r9</code> <code class="language-plaintext highlighter-rouge">r10</code> <code class="language-plaintext highlighter-rouge">r11</code> <code class="language-plaintext highlighter-rouge">r12</code> <code class="language-plaintext highlighter-rouge">r13</code> <code class="language-plaintext highlighter-rouge">r14</code> <code class="language-plaintext highlighter-rouge">r15</code> : added for x86-64</li>
</ul>

<p><img src="/img/x86/register.png" alt="" /></p>

<p>The “r” prefix indicates that they’re 64-bit registers. It won’t be
relevant in this article, but the same name prefixed with “e”
indicates the lower 32-bits of these same registers, and no prefix
indicates the lowest 16 bits. This is because x86 was <a href="/blog/2014/12/09/">originally a
16-bit architecture</a>, extended to 32-bits, then to 64-bits.
Historically each of of these registers had a specific, unique
purpose, but on x86-64 they’re almost completely interchangeable.</p>

<p>There’s also a “rip” instruction pointer register that conceptually
walks along the machine instructions as they’re being executed, but,
unlike the other registers, it can only be manipulated indirectly.
Remember that data and code <a href="http://en.wikipedia.org/wiki/Von_Neumann_architecture">live in the same address space</a>, so
rip is not much different than any other data pointer.</p>

<h4 id="the-stack">The Stack</h4>

<p>The rsp register points to the “top” of the call stack. The stack
keeps track of who called the current function, in addition to local
variables and other function state (a <em>stack frame</em>). I put “top” in
quotes because the stack actually grows <em>downward</em> on x86 towards
lower addresses, so the stack pointer points to the lowest address on
the stack. This piece of information is critical when talking about
threads, since we’ll be allocating our own stacks.</p>

<p>The stack is also sometimes used to pass arguments to another
function. This happens much less frequently on x86-64, especially with
the <a href="http://wiki.osdev.org/System_V_ABI">System V ABI</a> used by Linux, where the first 6 arguments are
passed via registers. The return value is passed back via rax. When
calling another function function, integer/pointer arguments are
passed in these registers in this order:</p>

<ul>
  <li>rdi, rsi, rdx, rcx, r8, r9</li>
</ul>

<p>So, for example, to perform a function call like <code class="language-plaintext highlighter-rouge">foo(1, 2, 3)</code>, store
1, 2 and 3 in rdi, rsi, and rdx, then <code class="language-plaintext highlighter-rouge">call</code> the function. The <code class="language-plaintext highlighter-rouge">mov</code>
instruction stores the source (second) operand in its destination
(first) operand. The <code class="language-plaintext highlighter-rouge">call</code> instruction pushes the current value of
rip onto the stack, then sets rip (<em>jumps</em>) to the address of the
target function. When the callee is ready to return, it uses the <code class="language-plaintext highlighter-rouge">ret</code>
instruction to <em>pop</em> the original rip value off the stack and back
into rip, returning control to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="mi">2</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">3</span>
    <span class="nf">call</span> <span class="nv">foo</span>
</code></pre></div></div>

<p>Called functions <em>must</em> preserve the contents of these registers (the
same value must be stored when the function returns):</p>

<ul>
  <li>rbx, rsp, rbp, r12, r13, r14, r15</li>
</ul>

<h4 id="system-calls">System Calls</h4>

<p>When making a <em>system call</em>, the argument registers are <a href="http://man7.org/linux/man-pages/man2/syscall.2.html">slightly
different</a>. Notice rcx has been changed to r10.</p>

<ul>
  <li>rdi, rsi, rdx, r10, r8, r9</li>
</ul>

<p>Each system call has an integer identifying it. This number is
different on each platform, but, in Linux’s case, <a href="https://www.youtube.com/watch?v=1Mg5_gxNXTo#t=8m28">it will <em>never</em>
change</a>. Instead of <code class="language-plaintext highlighter-rouge">call</code>, rax is set to the number of the
desired system call and the <code class="language-plaintext highlighter-rouge">syscall</code> instruction makes the request to
the OS kernel. Prior to x86-64, this was done with an old-fashioned
interrupt. Because interrupts are slow, a special,
statically-positioned “vsyscall” page (now deprecated as a <a href="http://en.wikipedia.org/wiki/Return-oriented_programming">security
hazard</a>), later <a href="https://lwn.net/Articles/446528/">vDSO</a>, is provided to allow certain system
calls to be made as function calls. We’ll only need the <code class="language-plaintext highlighter-rouge">syscall</code>
instruction in this article.</p>

<p>So, for example, the write() system call has this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">write</span><span class="p">(</span><span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">);</span>
</code></pre></div></div>

<p>On x86-64, the write() system call is at the top of <a href="https://filippo.io/linux-syscall-table/">the system call
table</a> as call 1 (read() is 0). Standard output is file
descriptor 1 by default (standard input is 0). The following bit of
code will write 10 bytes of data from the memory address <code class="language-plaintext highlighter-rouge">buffer</code> (a
symbol defined elsewhere in the assembly program) to standard output.
The number of bytes written, or -1 for error, will be returned in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; fd</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">buffer</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="mi">10</span>       <span class="c1">; 10 bytes</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">1</span>        <span class="c1">; SYS_write</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<h4 id="effective-addresses">Effective Addresses</h4>

<p>There’s one last thing you need to know: registers often hold a memory
address (i.e. a pointer), and you need a way to read the data behind
that address. In NASM syntax, wrap the register in brackets (e.g.
<code class="language-plaintext highlighter-rouge">[rax]</code>), which, if you’re familiar with C, would be the same as
<em>dereferencing</em> the pointer.</p>

<p>These bracket expressions, called an <em>effective address</em>, may be
limited mathematical expressions to offset that <em>base</em> address
entirely within a single instruction. This expression can include
another register (<em>index</em>), a power-of-two <em>scalar</em> (bit shift), and
an immediate signed <em>offset</em>. For example, <code class="language-plaintext highlighter-rouge">[rax + rdx*8 + 12]</code>. If
rax is a pointer to a struct, and rdx is an array index to an element
in array on that struct, only a single instruction is needed to read
that element. NASM is smart enough to allow the assembly programmer to
break this mold a little bit with more complex expressions, so long as
it can reduce it to the <code class="language-plaintext highlighter-rouge">[base + index*2^exp + offset]</code> form.</p>

<p>The details of addressing aren’t important this for this article, so
don’t worry too much about it if that didn’t make sense.</p>

<h3 id="allocating-a-stack">Allocating a Stack</h3>

<p>Threads share everything except for registers, a stack, and
thread-local storage (TLS). The OS and underlying hardware will
automatically ensure that registers are per-thread. Since it’s not
essential, I won’t cover thread-local storage in this article. In
practice, the stack is often used for thread-local data anyway. The
leaves the stack, and before we can span a new thread, we need to
allocate a stack, which is nothing more than a memory buffer.</p>

<p>The trivial way to do this would be to reserve some fixed .bss
(zero-initialized) storage for threads in the executable itself, but I
want to do it the Right Way and allocate the stack dynamically, just
as Pthreads, or any other threading library, would. Otherwise the
application would be limited to a compile-time fixed number of
threads.</p>

<p>You <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">can’t just read from and write to arbitrary addresses</a> in
virtual memory, you first <a href="/blog/2015/03/19/">have to ask the kernel to allocate
pages</a>. There are two system calls this on Linux to do this:</p>

<ul>
  <li>
    <p>brk(): Extends (or shrinks) the heap of a running process, typically
located somewhere shortly after the .bss segment. Many allocators
will do this for small or initial allocations. This is a less
optimal choice for thread stacks because the stacks will be very
near other important data, near other stacks, and lack a guard page
(by default). It would be somewhat easier for an attacker to exploit
a buffer overflow. A guard page is a locked-down page just past the
absolute end of the stack that will trigger a segmentation fault on
a stack overflow, rather than allow a stack overflow to trash other
memory undetected. A guard page could still be created manually with
mprotect(). Also, there’s also no room for these stacks to grow.</p>
  </li>
  <li>
    <p>mmap(): Use an anonymous mapping to allocate a contiguous set of
pages at some randomized memory location. As we’ll see, you can even
tell the kernel specifically that you’re going to use this memory as
a stack. Also, this is simpler than using brk() anyway.</p>
  </li>
</ul>

<p>On x86-64, mmap() is system call 9. I’ll define a function to allocate
a stack with this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">stack_create</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</code></pre></div></div>

<p>The mmap() system call takes 6 arguments, but when creating an
anonymous memory map the last two arguments are ignored. For our
purposes, it looks like this C prototype.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span><span class="nf">mmap</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">,</span> <span class="kt">int</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">);</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">flags</code>, we’ll choose a private, anonymous mapping that, being a
stack, grows downward. Even with that last flag, the system call will
still return the bottom address of the mapping, which will be
important to remember later. It’s just a simple matter of setting the
arguments in the registers and making the system call.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_mmap	9
%define STACK_SIZE	(4096 * 1024)	</span><span class="c1">; 4 MB
</span>
<span class="nl">stack_create:</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">0</span>
    <span class="nf">mov</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nv">STACK_SIZE</span>
    <span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nv">PROT_WRITE</span> <span class="o">|</span> <span class="nv">PROT_READ</span>
    <span class="nf">mov</span> <span class="nv">r10</span><span class="p">,</span> <span class="nv">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="nv">MAP_PRIVATE</span> <span class="o">|</span> <span class="nv">MAP_GROWSDOWN</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_mmap</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Now we can allocate new stacks (or stack-sized buffers) as needed.</p>

<h3 id="spawning-a-thread">Spawning a Thread</h3>

<p>Spawning a thread is so simple that it doesn’t even require a branch
instruction! It’s a call to clone() with two arguments: clone flags
and a pointer to the new thread’s stack. It’s important to note that,
as in many cases, the glibc wrapper function has the arguments in a
different order than the system call. With the set of flags we’re
using, it takes two arguments.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">sys_clone</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">child_stack</span><span class="p">);</span>
</code></pre></div></div>

<p>Our thread spawning function will have this C prototype. It takes a
function as its argument and starts the thread running that function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">thread_create</span><span class="p">(</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="p">)(</span><span class="kt">void</span><span class="p">));</span>
</code></pre></div></div>

<p>The function pointer argument is passed via rdi, per the ABI. Store
this for safekeeping on the stack (<code class="language-plaintext highlighter-rouge">push</code>) in preparation for calling
stack_create(). When it returns, the address of the low end of stack
will be in rax.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">thread_create:</span>
    <span class="nf">push</span> <span class="nb">rdi</span>
    <span class="nf">call</span> <span class="nv">stack_create</span>
    <span class="nf">lea</span> <span class="nb">rsi</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="nv">STACK_SIZE</span> <span class="o">-</span> <span class="mi">8</span><span class="p">]</span>
    <span class="nf">pop</span> <span class="kt">qword</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
    <span class="nf">mov</span> <span class="nb">rdi</span><span class="p">,</span> <span class="nb">CL</span><span class="nv">ONE_VM</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FS</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_FILES</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_SIGHAND</span> <span class="o">|</span> <span class="err">\</span>
             <span class="nf">CLONE_PARENT</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_THREAD</span> <span class="o">|</span> <span class="nb">CL</span><span class="nv">ONE_IO</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_clone</span>
    <span class="nf">syscall</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The second argument to clone() is a pointer to the <em>high address</em> of
the stack (specifically, just above the stack). So we need to add
<code class="language-plaintext highlighter-rouge">STACK_SIZE</code> to rax to get the high end. This is done with the <code class="language-plaintext highlighter-rouge">lea</code>
instruction: <strong>l</strong>oad <strong>e</strong>ffective <strong>a</strong>ddress. Despite the brackets,
it doesn’t actually read memory at that address, but instead stores
the address in the destination register (rsi). I’ve moved it back by 8
bytes because I’m going to place the thread function pointer at the
“top” of the new stack in the next instruction. You’ll see why in a
moment.</p>

<p><img src="/img/x86/clone.png" alt="" /></p>

<p>Remember that the function pointer was pushed onto the stack for
safekeeping. This is popped off the current stack and written to that
reserved space on the new stack.</p>

<p>As you can see, it takes a lot of flags to create a thread with
clone(). Most things aren’t shared with the callee by default, so lots
of options need to be enabled. See the clone(2) man page for full
details on these flags.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CLONE_THREAD</code>: Put the new process in the same thread group.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_VM</code>: Runs in the same virtual memory space.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_PARENT</code>: Share a parent with the callee.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_SIGHAND</code>: Share signal handlers.</li>
  <li><code class="language-plaintext highlighter-rouge">CLONE_FS</code>, <code class="language-plaintext highlighter-rouge">CLONE_FILES</code>, <code class="language-plaintext highlighter-rouge">CLONE_IO</code>: Share filesystem information.</li>
</ul>

<p>A new thread will be created and the syscall will return in each of
the two threads at the same instruction, <em>exactly</em> like fork(). All
registers will be identical between the threads, except for rax, which
will be 0 in the new thread, and rsp which has the same value as rsi
in the new thread (the pointer to the new stack).</p>

<p><strong>Now here’s the really cool part</strong>, and the reason branching isn’t
needed. There’s no reason to check rax to determine if we are the
original thread (in which case we return to the caller) or if we’re
the new thread (in which case we jump to the thread function).
Remember how we seeded the new stack with the thread function? When
the new thread returns (<code class="language-plaintext highlighter-rouge">ret</code>), it will jump to the thread function
with a completely empty stack. The original thread, using the original
stack, will return to the caller.</p>

<p>The value returned by thread_create() is the process ID of the new
thread, which is essentially the thread object (e.g. Pthread’s
<code class="language-plaintext highlighter-rouge">pthread_t</code>).</p>

<h3 id="cleaning-up">Cleaning Up</h3>

<p>The thread function has to be careful not to return (<code class="language-plaintext highlighter-rouge">ret</code>) since
there’s nowhere to return. It will fall off the stack and terminate
the program with a segmentation fault. Remember that threads are just
processes? It must use the exit() syscall to terminate. This won’t
terminate the other threads.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">%define SYS_exit	60
</span>
<span class="nl">exit:</span>
    <span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">SYS_exit</span>
    <span class="nf">syscall</span>
</code></pre></div></div>

<p>Before exiting, it should free its stack with the munmap() system
call, so that no resources are leaked by the terminated thread. The
equivalent of pthread_join() by the main parent would be to use the
wait4() system call on the thread process.</p>

<h3 id="more-exploration">More Exploration</h3>

<p>If you found this interesting, be sure to check out the full demo link
at the top of this article. Now with the ability to spawn threads,
it’s a great opportunity to explore and experiment with x86’s
synchronization primitives, such as the <code class="language-plaintext highlighter-rouge">lock</code> instruction prefix,
<code class="language-plaintext highlighter-rouge">xadd</code>, and <a href="/blog/2014/09/02/">compare-and-exchange</a> (<code class="language-plaintext highlighter-rouge">cmpxchg</code>). I’ll discuss
these in a future article.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>A Basic Just-In-Time Compiler</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/19/"/>
    <id>urn:uuid:95e0437f-61f0-3932-55b7-f828e171d9ca</id>
    <updated>2015-03-19T04:57:55Z</updated>
    <category term="c"/><category term="tutorial"/><category term="netsec"/><category term="x86"/><category term="posix"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=17747759">on Hacker News</a> and <a href="https://old.reddit.com/r/programming/comments/akxq8q/a_basic_justintime_compiler/">on reddit</a>.</em></p>

<p><a href="http://redd.it/2z68di">Monday’s /r/dailyprogrammer challenge</a> was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(<code class="language-plaintext highlighter-rouge">u(0)</code>) and a sequence of operations, <code class="language-plaintext highlighter-rouge">f</code>, to apply to the previous
term (<code class="language-plaintext highlighter-rouge">u(n + 1) = f(u(n))</code>) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.</p>

<!--more-->

<p>For example, the relation <code class="language-plaintext highlighter-rouge">u(n + 1) = (u(n) + 2) * 3 - 5</code> would be
input as <code class="language-plaintext highlighter-rouge">+2 *3 -5</code>. If <code class="language-plaintext highlighter-rouge">u(0) = 0</code> then,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">u(1) = 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(2) = 4</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(3) = 13</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(4) = 40</code></li>
  <li><code class="language-plaintext highlighter-rouge">u(5) = 121</code></li>
  <li>…</li>
</ul>

<p>Rather than write an interpreter to apply the sequence of operations,
for <a href="https://gist.github.com/skeeto/3a1aa3df31896c9956dc">my submission</a> (<a href="/download/jit.c">mirror</a>) I took the opportunity to
write a simple x86-64 Just-In-Time (JIT) compiler. So rather than
stepping through the operations one by one, my program converts the
operations into native machine code and lets the hardware do the work
directly. In this article I’ll go through how it works and how I did
it.</p>

<p><strong>Update</strong>: The <a href="http://redd.it/2zna5q">follow-up challenge</a> uses Reverse Polish
notation to allow for more complicated expressions. I wrote another
JIT compiler for <a href="https://gist.github.com/anonymous/f7e4a5086a2b0acc83aa">my submission</a> (<a href="/download/rpn-jit.c">mirror</a>).</p>

<h3 id="allocating-executable-memory">Allocating Executable Memory</h3>

<p>Modern operating systems have page-granularity protections for
different parts of <a href="http://marek.vavrusa.com/c/memory/2015/02/20/memory/">process memory</a>: read, write, and execute.
Code can only be executed from memory with the execute bit set on its
page, memory can only be changed when its write bit is set, and some
pages aren’t allowed to be read. In a running process, the pages
holding program code and loaded libraries will have their write bit
cleared and execute bit set. Most of the other pages will have their
execute bit cleared and their write bit set.</p>

<p>The reason for this is twofold. First, it significantly increases the
security of the system. If untrusted input was read into executable
memory, an attacker could input machine code (<em>shellcode</em>) into the
buffer, then exploit a flaw in the program to cause control flow to
jump to and execute that code. If the attacker is only able to write
code to non-executable memory, this attack becomes a lot harder. The
attacker has to rely on code already loaded into executable pages
(<a href="http://en.wikipedia.org/wiki/Return-oriented_programming"><em>return-oriented programming</em></a>).</p>

<p>Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, <code class="language-plaintext highlighter-rouge">NULL</code>
points to a special page with read, write, and execute disabled.</p>

<h4 id="an-instruction-buffer">An Instruction Buffer</h4>

<p>Memory returned by <code class="language-plaintext highlighter-rouge">malloc()</code> and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through <code class="language-plaintext highlighter-rouge">malloc()</code>, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an <code class="language-plaintext highlighter-rouge">asmbuf</code> struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define PAGE_SIZE 4096
</span>
<span class="k">struct</span> <span class="n">asmbuf</span> <span class="p">{</span>
    <span class="kt">uint8_t</span> <span class="n">code</span><span class="p">[</span><span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">uint64_t</span><span class="p">)];</span>
    <span class="kt">uint64_t</span> <span class="n">count</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use <code class="language-plaintext highlighter-rouge">sysconf(_SC_PAGESIZE)</code> to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.</p>

<p>Instead of <code class="language-plaintext highlighter-rouge">malloc()</code>, the compiler allocates memory as an anonymous
memory map (<code class="language-plaintext highlighter-rouge">mmap()</code>). It’s anonymous because it’s not backed by a
file.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">prot</span> <span class="o">=</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Windows doesn’t have POSIX <code class="language-plaintext highlighter-rouge">mmap()</code>, so on that platform we use
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> instead. Here’s the equivalent in Win32.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span>
<span class="nf">asmbuf_create</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">type</span> <span class="o">=</span> <span class="n">MEM_RESERVE</span> <span class="o">|</span> <span class="n">MEM_COMMIT</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">VirtualAlloc</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">PAGE_READWRITE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Anyone reading closely should notice that I haven’t actually requested
that the memory be executable, which is, like, the whole point of all
this! This was intentional. Some operating systems employ a security
feature called W^X: “write xor execute.” That is, memory is either
writable or executable, but never both at the same time. This makes
the shellcode attack I described before even harder. For <a href="http://www.tedunangst.com/flak/post/now-or-never-exec">well-behaved
JIT compilers</a> it means memory protections need to be adjusted
after code generation and before execution.</p>

<p>The POSIX <code class="language-plaintext highlighter-rouge">mprotect()</code> function is used to change memory protections.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">mprotect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_EXEC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or on Win32 (that last parameter is not allowed to be <code class="language-plaintext highlighter-rouge">NULL</code>),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_finalize</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">DWORD</span> <span class="n">old</span><span class="p">;</span>
    <span class="n">VirtualProtect</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">),</span> <span class="n">PAGE_EXECUTE_READ</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">old</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally, instead of <code class="language-plaintext highlighter-rouge">free()</code> it gets unmapped.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">munmap</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">PAGE_SIZE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And on Win32,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">asmbuf_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">VirtualFree</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MEM_RELEASE</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I won’t list the definitions here, but there are two “methods” for
inserting instructions and immediate values into the buffer. This will
be raw machine code, so the caller will be acting a bit like an
assembler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">ins</span><span class="p">);</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="k">struct</span> <span class="n">asmbuf</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="calling-conventions">Calling Conventions</h3>

<p>We’re only going to be concerned with three of x86-64’s many
registers: <code class="language-plaintext highlighter-rouge">rdi</code>, <code class="language-plaintext highlighter-rouge">rax</code>, and <code class="language-plaintext highlighter-rouge">rdx</code>. These are 64-bit (<code class="language-plaintext highlighter-rouge">r</code>) extensions
of <a href="/blog/2014/12/09/">the original 16-bit 8086 registers</a>. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="nf">recurrence</span><span class="p">(</span><span class="kt">long</span><span class="p">);</span>
</code></pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions">The System V AMD64 ABI calling convention</a> says that the first
integer/pointer function argument is passed in the <code class="language-plaintext highlighter-rouge">rdi</code> register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in <code class="language-plaintext highlighter-rouge">rax</code> when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy <code class="language-plaintext highlighter-rouge">rdi</code> to <code class="language-plaintext highlighter-rouge">rax</code>.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rax</span><span class="p">,</span> <span class="nb">rdi</span>
</code></pre></div></div>

<p>There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in <code class="language-plaintext highlighter-rouge">asmbuf</code>. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in <code class="language-plaintext highlighter-rouge">rcx</code> rather than <code class="language-plaintext highlighter-rouge">rdi</code>. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.</p>

<p>The very last thing it will do, assuming the result is in <code class="language-plaintext highlighter-rouge">rax</code>, is
return to the caller.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">ret</span>
</code></pre></div></div>

<p>So we know the assembly, but what do we pass to <code class="language-plaintext highlighter-rouge">asmbuf_ins()</code>? This
is where we get our hands dirty.</p>

<h4 id="finding-the-code">Finding the Code</h4>

<p>If you want to do this the Right Way, you go download the x86-64
documentation, look up the instructions we’re using, and manually work
out the bytes we need and how the operands fit into it. You know, like
they used to do <a href="/blog/2016/11/17/">out of necessity</a> back in the 60’s.</p>

<p>Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file <code class="language-plaintext highlighter-rouge">peek.s</code> and hand it to <code class="language-plaintext highlighter-rouge">nasm</code>. It will produce a raw binary
with the machine code, which we’ll disassemble with <code class="language-plaintext highlighter-rouge">nidsasm</code> (the
NASM disassembler).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret
</code></pre></div></div>

<p>That’s straightforward. The first instruction is 3 bytes and the
return is 1 byte.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4889f8</span><span class="p">);</span>  <span class="c1">// mov   rax, rdi</span>
<span class="c1">// ... generate code ...</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mh">0xc3</span><span class="p">);</span>      <span class="c1">// ret</span>
</code></pre></div></div>

<p>For each operation, we’ll set it up so the operand will already be
loaded into <code class="language-plaintext highlighter-rouge">rdi</code> regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use <code class="language-plaintext highlighter-rouge">0x0123456789abcdef</code> as the
operand.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>   <span class="nb">rdi</span><span class="p">,</span> <span class="mh">0x0123456789abcdef</span>
</code></pre></div></div>

<p>Which disassembled with <code class="language-plaintext highlighter-rouge">ndisasm</code> is,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301
</code></pre></div></div>

<p>Notice the operand listed little endian immediately after the
instruction. That’s also easy!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">long</span> <span class="n">operand</span><span class="p">;</span>
<span class="n">scanf</span><span class="p">(</span><span class="s">"%ld"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
<span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mh">0x48bf</span><span class="p">);</span>         <span class="c1">// mov   rdi, operand</span>
<span class="n">asmbuf_immediate</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">operand</span><span class="p">);</span>
</code></pre></div></div>

<p>Apply the same discovery process individually for each operator you
want to support, accumulating the result in <code class="language-plaintext highlighter-rouge">rax</code> for each.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">switch</span> <span class="p">(</span><span class="n">operator</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="sc">'+'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4801f8</span><span class="p">);</span>   <span class="c1">// add   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'-'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4829f8</span><span class="p">);</span>   <span class="c1">// sub   rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'*'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mh">0x480fafc7</span><span class="p">);</span> <span class="c1">// imul  rax, rdi</span>
        <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="sc">'/'</span><span class="p">:</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x4831d2</span><span class="p">);</span>   <span class="c1">// xor   rdx, rdx</span>
        <span class="n">asmbuf_ins</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x48f7ff</span><span class="p">);</span>   <span class="c1">// idiv  rdi</span>
        <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As an exercise, try adding support for modulus operator (<code class="language-plaintext highlighter-rouge">%</code>), XOR
(<code class="language-plaintext highlighter-rouge">^</code>), and bit shifts (<code class="language-plaintext highlighter-rouge">&lt;</code>, <code class="language-plaintext highlighter-rouge">&gt;</code>). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the <a href="https://old.reddit.com/r/dailyprogrammer/comments/2z68di/_/cpgkcx7">closed form solution</a> to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.</p>

<h3 id="calling-the-generated-code">Calling the Generated Code</h3>

<p>Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a <code class="language-plaintext highlighter-rouge">void *</code> just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmbuf_finalize</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="kt">long</span> <span class="p">(</span><span class="o">*</span><span class="n">recurrence</span><span class="p">)(</span><span class="kt">long</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="o">-&gt;</span><span class="n">code</span><span class="p">;</span>
<span class="c1">// ...</span>
<span class="n">x</span><span class="p">[</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">recurrence</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]);</span>
</code></pre></div></div>

<p>That’s pretty cool if you ask me! Now this was an extremely simplified
situation. There’s no branching, no intermediate values, no function
calls, and I didn’t even touch the stack (push, pop). The recurrence
relation definition in this challenge is practically an assembly
language itself, so after the initial setup it’s a 1:1 translation.</p>

<p>I’d like to build a JIT compiler more advanced than this in the
future. I just need to find a suitable problem that’s more complicated
than this one, warrants having a JIT compiler, but is still simple
enough that I could, on some level, justify not using LLVM.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Goblin-COM 7DRL 2015</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/03/15/"/>
    <id>urn:uuid:362ccedf-9538-358f-9474-5befd8bce4de</id>
    <updated>2015-03-15T21:56:12Z</updated>
    <category term="game"/><category term="media"/><category term="win32"/><category term="c"/>
    <content type="html">
      <![CDATA[<p>Yesterday I completed my third entry to the annual Seven Day Roguelike
(7DRL) challenge (previously: <a href="/blog/2013/03/17/">2013</a> and <a href="/blog/2014/03/31/">2014</a>). This
year’s entry is called <strong>Goblin-COM</strong>.</p>

<p><a href="/img/screenshot/gcom.png"><img src="/img/screenshot/gcom-thumb.png" alt="" /></a></p>

<ul>
  <li>Download/Source: <a href="https://github.com/skeeto/goblin-com">Goblin-COM</a></li>
  <li>Telnet play (no saves): <code class="language-plaintext highlighter-rouge">telnet gcom.nullprogram.com</code></li>
  <li><a href="https://www.youtube.com/watch?v=QW3Uul7-Iss">Video review</a> by Akhier Dragonheart</li>
</ul>

<p>As with previous years, the ideas behind the game are not all that
original. The goal was to be a fantasy version of <a href="http://en.wikipedia.org/wiki/UFO:_Enemy_Unknown">classic
X-COM</a> with an ANSI terminal interface. You are the ruler of a
fledgling human nation that is under attack by invading goblins. You
hire heroes, operate squads, construct buildings, and manage resource
income.</p>

<p>The inspiration this year came from watching <a href="https://www.youtube.com/playlist?list=PL2xITSnTC0YkB2-B8fs-02YVT81AE0WtP">BattleBunny</a> play
<a href="http://openxcom.org/">OpenXCOM</a>, an open source clone of the original X-COM. It
had its major 1.0 release last year. Like the early days of
<a href="https://www.openttd.org/en/">OpenTTD</a>, it currently depends on the original game assets.
But also like OpenTTD, it surpasses the original game in every way, so
there’s no reason to bother running the original anymore. I’ve also
recently been watching <a href="https://youtu.be/bwPLKud0rP4">One F Jef play Silent Storm</a>, which is
another turn-based squad game with a similar combat simulation.</p>

<p>As in X-COM, the game is broken into two modes of play: the geoscape
(strategic) and the battlescape (tactical). Unfortunately I ran out of
time and didn’t get to the battlescape part, though I’d like to add it
in the future. What’s left is a sort-of city-builder with some squad
management. You can hire heroes and send them out in squads to
eliminate goblins, but rather than dropping to the battlescape,
battles always auto-resolve in your favor. Despite this, the game
still has a story, a win state, and a lose state. I won’t say what
they are, so you have to play it for yourself!</p>

<h3 id="terminal-emulator-layer">Terminal Emulator Layer</h3>

<p>My previous entries were HTML5 games, but this entry is a plain old
standalone application. C has been my preferred language for the past
few months, so that’s what I used. Both UTF-8-capable ANSI terminals
and the Windows console are supported, so it should be perfectly
playable on any modern machine. Note, though, that some of the
poorer-quality terminal emulators that you’ll find in your Linux
distribution’s repositories (rxvt and its derivatives) are not
Unicode-capable, which means they won’t work with G-COM.</p>

<p>I <strong>didn’t make use of ncurses</strong>, instead opting to write my own
terminal graphics engine. That’s because I wanted a <a href="/blog/2014/12/09/">single, small
binary</a> that was easy to build, and I didn’t want to mess around
with <a href="http://pdcurses.sourceforge.net/">PDCurses</a>. I’ve also been studying the Win32 API lately, so
writing my own terminal platform layer would rather easy to do anyway.</p>

<p>I experimented with a number of terminal emulators — LXTerminal,
Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator — but
the least capable “terminal” <em>by far</em> is the Windows console, so it
was the one to dictate the capabilities of the graphics engine. Some
ANSI terminals are capable of 256 colors, bold, underline, and
strikethrough fonts, but a highly portable API is basically <strong>limited
to 16 colors</strong> (RGBCMYKW with two levels of intensity) for each of the
foreground and background, and no other special text properties.</p>

<p>ANSI terminals also have a concept of a default foreground color and a
default background color. Most applications that output color (git,
grep, ls) leave the background color alone and are careful to choose
neutral foreground colors. G-COM always sets the background color, so
that the game looks the same no matter what the default colors are.
Also, the Windows console doesn’t really have default colors anyway,
even if I wanted to use them.</p>

<p>I put in partial support for Unicode because I wanted to use
interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported
Unicode for a long time now, but since they added it <em>too</em> early,
they’re locked into the <a href="http://utf8everywhere.org/">outdated UTF-16</a>. For me this wasn’t
too bad, because few computers, Linux included, are equipped to render
characters outside of the <a href="http://en.wikipedia.org/wiki/Plane_(Unicode)">Basic Multilingual Plane</a> anyway, so
there’s no need to deal with surrogate pairs. This is especially true
for the Windows console, which can only render a <em>very</em> small set of
characters: another limit on my graphics engine. Internally individual
codepoints are handled as <code class="language-plaintext highlighter-rouge">uint16_t</code> and strings are handled as UTF-8.</p>

<p>I said <em>partial</em> support because, in addition to the above, it has no
support for combining characters, or any other situation where a
codepoint takes up something other than one space in the terminal.
This requires lookup tables and dealing with <a href="/blog/2014/06/13/">pitfalls</a>, but
since I get to control exactly which characters were going to be used
I didn’t need any of that.</p>

<p>In spite of the limitations, I’m <em>really</em> happy with the graphical
results. The waves are animated continuously, even while the game is
paused, and it looks great. Here’s GNOME Terminal’s rendering, which I
think looked the best by default.</p>

<video width="480" height="400" controls="" loop="" autoplay="">
  <source src="/vid/gcom.webm" type="video/webm" />
  <source src="/vid/gcom.mp4" type="video/mp4" />
</video>

<p>I’ll talk about how G-COM actually communicates with the terminal in
another article. The interface between the game and the graphics
engine is really clean (<code class="language-plaintext highlighter-rouge">device.h</code>), so it would be an interesting
project to write a back end that renders the game to a regular window,
no terminal needed.</p>

<h4 id="color-directive">Color Directive</h4>

<p>I came up with a format directive to help me colorize everything. It
runs in addition to the standard <code class="language-plaintext highlighter-rouge">printf</code> directives. Here’s an example,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="o">&amp;</span><span class="n">panel</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Really save and quit? (Rk{y}/Rk{n})"</span><span class="p">);</span>
</code></pre></div></div>

<p>The color is specified by two characters, and the text it applies to
is wrapped in curly brackets. There are eight colors to pick from:
RGBCMYKW. That covers all the binary values for red, green, and blue.
To specify an “intense” (bright) color, capitalize it. That means the
<code class="language-plaintext highlighter-rouge">Rk{...}</code> above makes the wrapped text bright red.</p>

<p><img src="/img/screenshot/gcom-yn.png" alt="" /></p>

<p>Nested directives are also supported. (And, yes, that <code class="language-plaintext highlighter-rouge">K</code> means “high
intense black,” a.k.a. dark gray. A <code class="language-plaintext highlighter-rouge">w</code> means “low intensity white,”
a.k.a. light gray.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">,</span> <span class="s">"Kk{♦}    wk{Rk{B}uild}     Kk{♦}"</span><span class="p">);</span>
</code></pre></div></div>

<p>And it mixes with the normal <code class="language-plaintext highlighter-rouge">printf</code> directives:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">panel_printf</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="o">++</span><span class="p">,</span> <span class="s">"(Rk{m}) Yk{Mine} [%s]"</span><span class="p">,</span> <span class="n">cost</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="single-binary">Single Binary</h3>

<p>The GNU linker has a really nice feature for linking arbitrary binary
data into your application. I used this to embed my assets into a
single binary so that the user doesn’t need to worry about any sort of
data directory or anything like that. Here’s what the <code class="language-plaintext highlighter-rouge">make</code> rule
would look like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$(LD) -r -b binary -o $@ $^
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-r</code> specifies that output should be relocatable — i.e. it can be
fed back into the linker later when linking the final binary. The <code class="language-plaintext highlighter-rouge">-b
binary</code> says that the input is just an opaque binary file (“plain”
text included). The linker will create three symbols for each input
file:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_start</code></li>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_end</code></li>
  <li><code class="language-plaintext highlighter-rouge">_binary_filename_size</code></li>
</ul>

<p>When then you can access from your C program like so:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">_binary_filename_txt_start</span><span class="p">[];</span>
</code></pre></div></div>

<p>I used this to embed the story texts, and I’ve used it in the past to
embed images and textures. If you were to link zlib, you could easily
compress these assets, too. I’m surprised this sort of thing isn’t
done more often!</p>

<h3 id="dumb-game-saves">Dumb Game Saves</h3>

<p>To save time, and because it doesn’t really matter, saves are just
memory dumps. I took another page from <a href="http://handmadehero.org/">Handmade Hero</a> and
allocate everything in a single, contiguous block of memory. With one
exception, there are no pointers, so the entire block is relocatable.
When references are needed, it’s done via integers into the embedded
arrays. This allows it to be cleanly reloaded in another process
later. As a side effect, it also means there are no dynamic
allocations (<code class="language-plaintext highlighter-rouge">malloc()</code>) while the game is running. Here’s roughly
what it looks like.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">game</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">map_seed</span><span class="p">;</span>
    <span class="n">map_t</span> <span class="o">*</span><span class="n">map</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">time</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">wood</span><span class="p">,</span> <span class="n">gold</span><span class="p">,</span> <span class="n">food</span><span class="p">;</span>
    <span class="kt">long</span> <span class="n">population</span><span class="p">;</span>
    <span class="kt">float</span> <span class="n">goblin_spawn_rate</span><span class="p">;</span>
    <span class="n">invader_t</span> <span class="n">invaders</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">squad_t</span> <span class="n">squads</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
    <span class="n">hero_t</span> <span class="n">heroes</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span>
    <span class="n">game_event_t</span> <span class="n">events</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span>
<span class="p">}</span> <span class="n">game_t</span><span class="p">;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">map</code> pointer is that one exception, but that’s because it’s
generated fresh after loading from the <code class="language-plaintext highlighter-rouge">map_seed</code>. Saving and loading
is trivial (error checking omitted) and <em>very</em> fast.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">game_save</span><span class="p">(</span><span class="n">game_t</span> <span class="o">*</span><span class="n">game</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">fwrite</span><span class="p">(</span><span class="n">game</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">game_t</span> <span class="o">*</span>
<span class="nf">game_load</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">game_t</span> <span class="o">*</span><span class="n">game</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">));</span>
    <span class="n">fread</span><span class="p">(</span><span class="n">game</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">game</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
    <span class="n">game</span><span class="o">-&gt;</span><span class="n">map</span> <span class="o">=</span> <span class="n">map_generate</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">map_seed</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">game</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The data isn’t important enough to bother with <a href="http://lwn.net/Articles/322823/">rename+fsync</a>
durability. I’ll risk the data if it makes savescumming that much
harder!</p>

<p>The downside to this technique is that saves are generally not
portable across architectures (particularly where endianness differs),
and may not even portable between different platforms on the same
architecture. I only needed to persist a single game state on the same
machine, so this wouldn’t be a problem.</p>

<h3 id="final-results">Final Results</h3>

<p>I’m definitely going to be reusing some of this code in future
projects. The G-COM terminal graphics layer is nifty, and I already
like it better than ncurses, whose API I’ve always thought was kind of
ugly and old-fashioned. I like writing terminal applications.</p>

<p>Just like the last couple of years, the final game is a lot simpler
than I had planned at the beginning of the week. Most things take
longer to code than I initially expect. I’m still enjoying playing it,
which is a really good sign. When I play, I’m having enough fun to
deliberately delay the end of the game so that I can sprawl my nation
out over the island and generate crazy income.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Generic C Reference Counting</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2015/02/17/"/>
    <id>urn:uuid:58357076-8f76-3506-0c5e-198bfc711f8d</id>
    <updated>2015-02-17T04:06:11Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>As a result of making regular use of <a href="/blog/2014/10/21/">object-oriented programming in
C</a>, I’ve discovered a useful reference counting technique for the
occasional dynamically allocated structs that need it. The situation
arises when the same struct instance is shared between an arbitrary
number of other data structures and I need to keep track of it all.</p>

<p>It’s <em>incredibly</em> simple and lives entirely in a header file, so
without further ado (<code class="language-plaintext highlighter-rouge">ref.h</code>):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
</span>
<span class="k">struct</span> <span class="n">ref</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">int</span> <span class="n">count</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_inc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">((</span><span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="p">)</span><span class="n">ref</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_dec</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">--</span><span class="p">((</span><span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="p">)</span><span class="n">ref</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">count</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">ref</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">(</span><span class="n">ref</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It has only two fields: the reference count and a “method” that knows
how to free the object once the reference count hits 0. Structs using
this reference counter will know how to free themselves, so callers
will never call a specific <code class="language-plaintext highlighter-rouge">*_destroy()</code>/<code class="language-plaintext highlighter-rouge">*_free()</code> function. Instead
they call <code class="language-plaintext highlighter-rouge">ref_dec()</code> to decrement the reference counter and let it
happen on its own.</p>

<p>I decided to go with a signed count because it allows for better error
checking. It may be worth putting an <code class="language-plaintext highlighter-rouge">assert()</code> in <code class="language-plaintext highlighter-rouge">ref_inc()</code> and
<code class="language-plaintext highlighter-rouge">ref_dec()</code> to ensure the count is always non-negative. I chose an
<code class="language-plaintext highlighter-rouge">int</code> because it’s fast, and anything smaller will be padded out to
<em>at least</em> that size anyway. On x86-64, <code class="language-plaintext highlighter-rouge">struct ref</code> is 16 bytes.</p>

<p>This is basically all there is to a C++ <a href="http://en.cppreference.com/w/cpp/memory/shared_ptr">shared_ptr</a>,
leveraging C++’s destructors and performing all increment/decrement
work automatically.</p>

<h3 id="thread-safety">Thread Safety</h3>

<p>Those increments and decrements aren’t thread safe, so this won’t work
as-is when data structures are shared between threads. If you’re sure
that you’re using GCC on a capable platform, you can make use of its
<a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html">atomic builtins</a>, making the reference counter completely thread
safe.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_inc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">__sync_add_and_fetch</span><span class="p">((</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">count</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_dec</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">__sync_sub_and_fetch</span><span class="p">((</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">count</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
        <span class="n">ref</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">(</span><span class="n">ref</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or if you’re using C11, <a href="/blog/2014/09/02/">make use of the new stdatomic.h</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_inc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">atomic_fetch_add</span><span class="p">((</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">count</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">ref_dec</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">atomic_fetch_sub</span><span class="p">((</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">count</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span>
        <span class="n">ref</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">(</span><span class="n">ref</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="whats-that-const">What’s That Const?</h3>

<p>There’s a very deliberate decision to make all of the function
arguments <code class="language-plaintext highlighter-rouge">const</code>, for both reference counting functions and the
<code class="language-plaintext highlighter-rouge">free()</code> method. This may seem wrong because these functions are
<em>specifically</em> intended to modify the reference count. There are
dangerous-looking casts in each case to remove the <code class="language-plaintext highlighter-rouge">const</code>.</p>

<p>The reason for this is that’s it’s likely for someone holding a
<code class="language-plaintext highlighter-rouge">const</code> pointer to one of these objects to want to keep their own
reference. Their promise not to modify the object doesn’t <em>really</em>
apply to the reference count, which is merely embedded metadata. They
would need to cast the <code class="language-plaintext highlighter-rouge">const</code> away before being permitted to call
<code class="language-plaintext highlighter-rouge">ref_inc()</code> and <code class="language-plaintext highlighter-rouge">ref_dec()</code>. Rather than litter the program with
dangerous casts, the casts are all kept in one place — in the
reference counting functions — where they’re strictly limited to
mutating the reference counting fields.</p>

<p>On a related note, the <code class="language-plaintext highlighter-rouge">stdlib.h</code> <code class="language-plaintext highlighter-rouge">free()</code> function doesn’t take a
<code class="language-plaintext highlighter-rouge">const</code> pointer, so the <code class="language-plaintext highlighter-rouge">free()</code> method taking a <code class="language-plaintext highlighter-rouge">const</code> pointer is a
slight departure from the norm. Taking a non-<code class="language-plaintext highlighter-rouge">const</code> pointer <a href="http://yarchive.net/comp/const.html">was a
mistake in the C standard library</a>. The <code class="language-plaintext highlighter-rouge">free()</code> function
mutates the pointer itself — including all other pointers to that
object — making it invalid. Semantically, it doesn’t mutate the
memory <em>behind</em> the pointer, so it’s not actually violating the
<code class="language-plaintext highlighter-rouge">const</code>. To compare, the <a href="http://lxr.free-electrons.com/source/include/linux/slab.h#L144">Linux kernel <code class="language-plaintext highlighter-rouge">kfree()</code></a> takes a
<code class="language-plaintext highlighter-rouge">const void *</code>.</p>

<p>Just as users may need to increment and decrement the counters on
<code class="language-plaintext highlighter-rouge">const</code> objects, they’ll also need to be able to <code class="language-plaintext highlighter-rouge">free()</code> them, so
it’s also a <code class="language-plaintext highlighter-rouge">const</code>.</p>

<h3 id="usage-example">Usage Example</h3>

<p>So how does one use this generic reference counter? Embed a <code class="language-plaintext highlighter-rouge">struct
ref</code> in your own structure and use our old friend: the
<code class="language-plaintext highlighter-rouge">container_of()</code> macro. For anyone who’s forgotten, this macro not
part of standard C, but you can define it with <code class="language-plaintext highlighter-rouge">offsetof()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))
</span></code></pre></div></div>

<p>Here’s a dumb linked list example where each node is individually
reference counted. Adding an extra 16 bytes to each of your linked
list nodes isn’t normally going to help with much, but if the tail of
the linked list is being shared between different data structures
(such as other lists), reference counting makes things a lot simpler.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">node</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">id</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">float</span> <span class="n">value</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">ref</span> <span class="n">refcount</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>I put <code class="language-plaintext highlighter-rouge">refcount</code> at the end so that we’ll have to use <code class="language-plaintext highlighter-rouge">container_of()</code>
in this example. It conveniently casts away the <code class="language-plaintext highlighter-rouge">const</code> for us.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">node_free</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">ref</span> <span class="o">*</span><span class="n">ref</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">ref</span><span class="p">,</span> <span class="k">struct</span> <span class="n">node</span><span class="p">,</span> <span class="n">refcount</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">child</span> <span class="o">=</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="n">free</span><span class="p">(</span><span class="n">node</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">child</span><span class="p">)</span>
        <span class="n">ref_dec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">child</span><span class="o">-&gt;</span><span class="n">refcount</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice that it recursively decrements its child’s reference count
afterwards (intentionally tail recursive). A whole list will clean
itself up when the head is freed and no part of the list is shared.</p>

<p>The allocation function sets up the <code class="language-plaintext highlighter-rouge">free()</code> function pointer and
initializes the count to 1.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">node</span> <span class="o">*</span>
<span class="nf">node_create</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">id</span><span class="p">,</span> <span class="kt">float</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">node</span><span class="p">));</span>
    <span class="n">snprintf</span><span class="p">(</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">id</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">id</span><span class="p">),</span> <span class="s">"%s"</span><span class="p">,</span> <span class="n">id</span><span class="p">);</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">refcount</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">ref</span><span class="p">){</span><span class="n">node_free</span><span class="p">,</span> <span class="mi">1</span><span class="p">};</span>
    <span class="k">return</span> <span class="n">node</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>(Side note: I used <code class="language-plaintext highlighter-rouge">snprintf()</code> because <a href="https://randomascii.wordpress.com/2013/04/03/stop-using-strncpy-already/"><code class="language-plaintext highlighter-rouge">strncpy()</code> is
broken</a> and <code class="language-plaintext highlighter-rouge">strlcpy()</code> is non-standard, so it’s the most
straightforward way to do this in standard C.);</p>

<p>And to start making some use of the reference counter, here’s push and
pop.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">node_push</span><span class="p">(</span><span class="k">struct</span> <span class="n">node</span> <span class="o">**</span><span class="n">nodes</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">id</span><span class="p">,</span> <span class="kt">float</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">node_create</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="o">*</span><span class="n">nodes</span><span class="p">;</span>
    <span class="o">*</span><span class="n">nodes</span> <span class="o">=</span> <span class="n">node</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">node</span> <span class="o">*</span>
<span class="nf">node_pop</span><span class="p">(</span><span class="k">struct</span> <span class="n">node</span> <span class="o">**</span><span class="n">nodes</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="o">*</span><span class="n">nodes</span><span class="p">;</span>
    <span class="o">*</span><span class="n">nodes</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">)</span>
        <span class="n">ref_inc</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">refcount</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">node</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice <code class="language-plaintext highlighter-rouge">node_pop()</code> increments the reference count of the new head
node before returning. That’s because the node now has an additional
reference: from <code class="language-plaintext highlighter-rouge">*nodes</code> <em>and</em> from the node that was just popped.
It’s up to the caller to free the returned node, which would decrement
the count of the new head node, but not free it. Alternatively
<code class="language-plaintext highlighter-rouge">node_pop()</code> could set <code class="language-plaintext highlighter-rouge">next</code> on the returned node to NULL rather than
increment the counter, which would also prevent the returned node from
freeing the new head when it gets freed. But it’s probably more useful
for the returned node to keep functioning as a list. That’s what the
reference counting is for, after all.</p>

<p>Finally, a simple program to exercise it all. It reads ID/value pairs
from standard input.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span>
<span class="nf">node_print</span><span class="p">(</span><span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(;</span> <span class="n">node</span><span class="p">;</span> <span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span>
        <span class="n">printf</span><span class="p">(</span><span class="s">"%s = %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">id</span><span class="p">,</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">nodes</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="kt">char</span> <span class="n">id</span><span class="p">[</span><span class="mi">64</span><span class="p">];</span>
    <span class="kt">float</span> <span class="n">value</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">scanf</span><span class="p">(</span><span class="s">" %63s %f"</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span>
        <span class="n">node_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nodes</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">nodes</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">node_print</span><span class="p">(</span><span class="n">nodes</span><span class="p">);</span>
        <span class="k">struct</span> <span class="n">node</span> <span class="o">*</span><span class="n">old</span> <span class="o">=</span> <span class="n">node_pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nodes</span><span class="p">);</span>
        <span class="n">node_push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nodes</span><span class="p">,</span> <span class="s">"foobar"</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="n">f</span><span class="p">);</span>
        <span class="n">node_print</span><span class="p">(</span><span class="n">nodes</span><span class="p">);</span>
        <span class="n">ref_dec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">old</span><span class="o">-&gt;</span><span class="n">refcount</span><span class="p">);</span>
        <span class="n">ref_dec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nodes</span><span class="o">-&gt;</span><span class="n">refcount</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve used this technique several times over the past few months. It’s
trivial to remember, so I just code it up from scratch each time I
need it.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Interactive Programming in C</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/12/23/"/>
    <id>urn:uuid:203e981d-b086-393e-27c0-db18dacfc4bf</id>
    <updated>2014-12-23T05:43:41Z</updated>
    <category term="c"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p>I’m a huge fan of interactive programming (see: <a href="/blog/2012/10/31/">JavaScript</a>,
<a href="/blog/2011/08/30/">Java</a>, <a href="http://common-lisp.net/project/slime/">Lisp</a>, <a href="https://github.com/clojure-emacs/cider">Clojure</a>). That is, modifying and
extending a program while it’s running. For certain kinds of non-batch
applications, it takes much of the tedium out of testing and tweaking
during development. Until last week I didn’t know how to apply
interactive programming to C. How does one go about redefining
functions in a running C program?</p>

<p>Last week in <a href="http://handmadehero.org/">Handmade Hero</a> (days 21-25), Casey Muratori added
interactive programming to the game engine. This is especially useful
in game development, where the developer might want to tweak, say, a
boss fight without having to restart the entire game after each tweak.
Now that I’ve seen it done, it seems so obvious. <strong>The secret is to
build almost the entire application as a shared library.</strong></p>

<p>This puts a serious constraint on the design of the program: <strong>it
cannot keep any state in global or static variables</strong>, though this
<a href="/blog/2014/10/12/">should be avoided anyway</a>. Global state will be lost each
time the shared library is reloaded. In some situations, this can also
restrict use of the C standard library, including functions like
<code class="language-plaintext highlighter-rouge">malloc()</code>, depending on how these functions are implemented or
linked. For example, if the C standard library is statically linked,
functions with global state may introduce global state into the shared
library. It’s difficult to know what’s safe to use. This works fine in
Handmade Hero because the core game, the part loaded as a shared
library, makes no use of external libraries, including the standard
library.</p>

<p>Additionally, the shared library must be careful with its use of
function pointers. The functions being pointed at will no longer exist
after a reload. This is a real issue when combining interactive
programming with <a href="/blog/2014/10/21/">object oriented C</a>.</p>

<h3 id="an-example-with-the-game-of-life">An example with the Game of Life</h3>

<p>To demonstrate how this works, let’s go through an example. I wrote a
simple ncurses Game of Life demo that’s easy to modify. You can get
the entire source here if you’d like to play around with it yourself
on a Unix-like system.</p>

<ul>
  <li><a href="https://github.com/skeeto/interactive-c-demo">https://github.com/skeeto/interactive-c-demo</a></li>
</ul>

<p><strong>Quick start</strong>:</p>

<ol>
  <li>In a terminal run <code class="language-plaintext highlighter-rouge">make</code> then <code class="language-plaintext highlighter-rouge">./main</code>. Press <code class="language-plaintext highlighter-rouge">r</code> randomize and <code class="language-plaintext highlighter-rouge">q</code>
to quit.</li>
  <li>Edit <code class="language-plaintext highlighter-rouge">game.c</code> to change the Game of Life rules, add colors, etc.</li>
  <li>In a second terminal run <code class="language-plaintext highlighter-rouge">make</code>. Your changes will be reflected
immediately in the original program!</li>
</ol>

<p><img src="/img/screenshot/live-c.gif" alt="" /></p>

<p>As of this writing, Handmade Hero is being written on Windows, so
Casey is using a DLL and the Win32 API, but the same technique can be
applied on Linux, or any other Unix-like system, using libdl. That’s
what I’ll be using here.</p>

<p>The program will be broken into two parts: the Game of Life shared
library (“game”) and a wrapper (“main”) whose job is only to load the
shared library, reload it when it updates, and call it at a regular
interval. The wrapper is agnostic about the operation of the “game”
portion, so it could be re-used almost untouched in another project.</p>

<p>To avoid maintaining a whole bunch of function pointer assignments in
several places, the API to the “game” is enclosed in a struct. This
also eliminates warnings from the C compiler about mixing data and
function pointers. The layout and contents of the <code class="language-plaintext highlighter-rouge">game_state</code>
struct is private to the game itself. The wrapper will only handle a
pointer to this struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">game_state</span><span class="p">;</span>

<span class="k">struct</span> <span class="n">game_api</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">init</span><span class="p">)();</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">finalize</span><span class="p">)(</span><span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="n">state</span><span class="p">);</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">reload</span><span class="p">)(</span><span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="n">state</span><span class="p">);</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">unload</span><span class="p">)(</span><span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="n">state</span><span class="p">);</span>
    <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">step</span><span class="p">)(</span><span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="n">state</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<p>In the demo the API is made of 5 functions. The first 4 are primarily
concerned with loading and unloading.</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">init()</code>: Allocate and return a state to be passed to every other
API call. This will be called once when the program starts and never
again, even after reloading. If we were concerned about using
<code class="language-plaintext highlighter-rouge">malloc()</code> in the shared library, the wrapper would be responsible
for performing the actual memory allocation.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">finalize()</code>: The opposite of <code class="language-plaintext highlighter-rouge">init()</code>, to free all resources held
by the game state.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">reload()</code>: Called immediately after the library is reloaded. This
is the chance to sneak in some additional initialization in the
running program. Normally this function will be empty. It’s only
used temporarily during development.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">unload()</code>: Called just before the library is unloaded, before a new
version is loaded. This is a chance to prepare the state for use by
the next version of the library. This can be used to update structs
and such, if you wanted to be really careful. This would also
normally be empty.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">step()</code>: Called at a regular interval to run the game. A real game
will likely have a few more functions like this.</p>
  </li>
</ul>

<p>The library will provide a filled out API struct as a global variable,
<code class="language-plaintext highlighter-rouge">GAME_API</code>. <strong>This is the only exported symbol in the entire shared
library!</strong> All functions will be declared static, including the ones
referenced by the structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="k">struct</span> <span class="n">game_api</span> <span class="n">GAME_API</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">.</span><span class="n">init</span>     <span class="o">=</span> <span class="n">game_init</span><span class="p">,</span>
    <span class="p">.</span><span class="n">finalize</span> <span class="o">=</span> <span class="n">game_finalize</span><span class="p">,</span>
    <span class="p">.</span><span class="n">reload</span>   <span class="o">=</span> <span class="n">game_reload</span><span class="p">,</span>
    <span class="p">.</span><span class="n">unload</span>   <span class="o">=</span> <span class="n">game_unload</span><span class="p">,</span>
    <span class="p">.</span><span class="n">step</span>     <span class="o">=</span> <span class="n">game_step</span>
<span class="p">};</span>
</code></pre></div></div>

<h4 id="dlopen-dlsym-and-dlclose">dlopen, dlsym, and dlclose</h4>

<p>The wrapper is focused on calling <code class="language-plaintext highlighter-rouge">dlopen()</code>, <code class="language-plaintext highlighter-rouge">dlsym()</code>, and
<code class="language-plaintext highlighter-rouge">dlclose()</code> in the right order at the right time. The game will be
compiled to the file <code class="language-plaintext highlighter-rouge">libgame.so</code>, so that’s what will be loaded. It’s
written in the source with a <code class="language-plaintext highlighter-rouge">./</code> to force the name to be used as a
filename. The wrapper keeps track of everything in a <code class="language-plaintext highlighter-rouge">game</code> struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">GAME_LIBRARY</span> <span class="o">=</span> <span class="s">"./libgame.so"</span><span class="p">;</span>

<span class="k">struct</span> <span class="n">game</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">handle</span><span class="p">;</span>
    <span class="n">ino_t</span> <span class="n">id</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">game_api</span> <span class="n">api</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">game_state</span> <span class="o">*</span><span class="n">state</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">handle</code> is the value returned by <code class="language-plaintext highlighter-rouge">dlopen()</code>. The <code class="language-plaintext highlighter-rouge">id</code> is the
inode of the shared library, as returned by <code class="language-plaintext highlighter-rouge">stat()</code>. The rest is
defined above. Why the inode? We could use a timestamp instead, but
that’s indirect. What we really care about is if the shared object
file is actually a different file than the one that was loaded. The
file will never be updated in place, it will be replaced by the
compiler/linker, so the timestamp isn’t what’s important.</p>

<p>Using the inode is a much simpler situation than in Handmade Hero. Due
to Windows’ broken file locking behavior, the game DLL can’t be
replaced while it’s being used. To work around this limitation, the
build system and the loader have to rely on randomly-generated
filenames.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">game_load</span><span class="p">(</span><span class="k">struct</span> <span class="n">game</span> <span class="o">*</span><span class="n">game</span><span class="p">)</span>
</code></pre></div></div>

<p>The purpose of the <code class="language-plaintext highlighter-rouge">game_load()</code> function is to load the game API into
a <code class="language-plaintext highlighter-rouge">game</code> struct, but only if either it hasn’t been loaded yet or if
it’s been updated. Since it has several independent failure
conditions, let’s examine it in parts.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">stat</span> <span class="n">attr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">((</span><span class="n">stat</span><span class="p">(</span><span class="n">GAME_LIBRARY</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">attr</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">id</span> <span class="o">!=</span> <span class="n">attr</span><span class="p">.</span><span class="n">st_ino</span><span class="p">))</span> <span class="p">{</span>
</code></pre></div></div>

<p>First, use <code class="language-plaintext highlighter-rouge">stat()</code> to determine if the library’s inode is different
than the one that’s already loaded. The <code class="language-plaintext highlighter-rouge">id</code> field will be 0
initially, so as long as <code class="language-plaintext highlighter-rouge">stat()</code> succeeds, this will load the library
the first time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">api</span><span class="p">.</span><span class="n">unload</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">state</span><span class="p">);</span>
        <span class="n">dlclose</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span><span class="p">);</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If a library is already loaded, unload it first, being sure to call
<code class="language-plaintext highlighter-rouge">unload()</code> to inform the library that it’s being updated. <strong>It’s
critically important that <code class="language-plaintext highlighter-rouge">dlclose()</code> happens before <code class="language-plaintext highlighter-rouge">dlopen()</code>.</strong> On
my system, <code class="language-plaintext highlighter-rouge">dlopen()</code> looks only at the string it’s given, not the
file behind it. Even though the file has been replaced on the
filesystem, <code class="language-plaintext highlighter-rouge">dlopen()</code> will see that the string matches a library
already opened and return a pointer to the old library. (Is this a
bug?) The handles are reference counted internally by libdl.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="kt">void</span> <span class="o">*</span><span class="n">handle</span> <span class="o">=</span> <span class="n">dlopen</span><span class="p">(</span><span class="n">GAME_LIBRARY</span><span class="p">,</span> <span class="n">RTLD_NOW</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally load the game library. There’s a race condition here that
cannot be helped due to limitations of <code class="language-plaintext highlighter-rouge">dlopen()</code>. The library may
have been updated <em>again</em> since the call to <code class="language-plaintext highlighter-rouge">stat()</code>. Since we can’t
ask <code class="language-plaintext highlighter-rouge">dlopen()</code> about the inode of the library it opened, we can’t
know. But as this is only used during development, not in production,
it’s not a big deal.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">if</span> <span class="p">(</span><span class="n">handle</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span> <span class="o">=</span> <span class="n">handle</span><span class="p">;</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">id</span> <span class="o">=</span> <span class="n">attr</span><span class="p">.</span><span class="n">st_ino</span><span class="p">;</span>
        <span class="cm">/* ... more below ... */</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">dlopen()</code> fails, it will return <code class="language-plaintext highlighter-rouge">NULL</code>. In the case of ELF, this
will happen if the compiler/linker is still in the process of writing
out the shared library. Since the unload was already done, this means
no game will be loaded when <code class="language-plaintext highlighter-rouge">game_load</code> returns. The user of the
struct needs to be prepared for this eventuality. It will need to try
loading again later (i.e. a few milliseconds). It may be worth filling
the API with stub functions when no library is loaded.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">const</span> <span class="k">struct</span> <span class="n">game_api</span> <span class="o">*</span><span class="n">api</span> <span class="o">=</span> <span class="n">dlsym</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span><span class="p">,</span> <span class="s">"GAME_API"</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">api</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">api</span> <span class="o">=</span> <span class="o">*</span><span class="n">api</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
            <span class="n">game</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">game</span><span class="o">-&gt;</span><span class="n">api</span><span class="p">.</span><span class="n">init</span><span class="p">();</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">api</span><span class="p">.</span><span class="n">reload</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">state</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">dlclose</span><span class="p">(</span><span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span><span class="p">);</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">handle</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
        <span class="n">game</span><span class="o">-&gt;</span><span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>When the library loads without error, look up the <code class="language-plaintext highlighter-rouge">GAME_API</code> struct
that was mentioned before and copy it into the local struct. Copying
rather than using the pointer avoids one more layer of redirection
when making function calls. The game state is initialized if it hasn’t
been already, and the <code class="language-plaintext highlighter-rouge">reload()</code> function is called to inform the game
it’s just been reloaded.</p>

<p>If looking up the <code class="language-plaintext highlighter-rouge">GAME_API</code> fails, close the handle and consider it
a failure.</p>

<p>The main loop calls <code class="language-plaintext highlighter-rouge">game_load()</code> each time around. And that’s it!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">game</span> <span class="n">game</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
    <span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
        <span class="n">game_load</span><span class="p">(</span><span class="o">&amp;</span><span class="n">game</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">game</span><span class="p">.</span><span class="n">handle</span><span class="p">)</span>
            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">game</span><span class="p">.</span><span class="n">api</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="n">game</span><span class="p">.</span><span class="n">state</span><span class="p">))</span>
                <span class="k">break</span><span class="p">;</span>
        <span class="n">usleep</span><span class="p">(</span><span class="mi">100000</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="n">game_unload</span><span class="p">(</span><span class="o">&amp;</span><span class="n">game</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now that I have this technique in by toolbelt, it has me itching to
develop a proper, full game in C with OpenGL and all, perhaps in
<a href="/blog/2014/12/09/">another Ludum Dare</a>. The ability to develop interactively is very
appealing.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>How to build DOS COM files with GCC</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/12/09/"/>
    <id>urn:uuid:cff7d942-a91d-38b8-46fd-d05bbce0e212</id>
    <updated>2014-12-09T23:50:10Z</updated>
    <category term="c"/><category term="debian"/><category term="tutorial"/><category term="game"/>
    <content type="html">
      <![CDATA[<p><em>Update 2018: RenéRebe builds upon this article in an <a href="https://www.youtube.com/watch?v=Y7vU5T6rKHE">interesting
follow-up video</a> (<a href="https://www.youtube.com/watch?v=EXiF7g8Hmt4">part 2</a>).</em></p>

<p><em>Update 2020: DOS Defender <a href="https://www.youtube.com/watch?v=6UjuFnZYkG4">was featured on GET OFF MY LAWN</a>.</em></p>

<p>This past weekend I participated in <a href="http://ludumdare.com/compo/2014/12/03/welcome-to-ludum-dare-31/">Ludum Dare #31</a>. Before the
theme was even announced, due to <a href="/blog/2014/11/22/">recent fascination</a> I wanted
to make an old school DOS game. DOSBox would be the target platform
since it’s the most practical way to run DOS applications anymore,
despite modern x86 CPUs still being fully backwards compatible all the
way back to the 16-bit 8086.</p>

<p>I successfully created and submitted a DOS game called <a href="http://ludumdare.com/compo/ludum-dare-31/?uid=8472">DOS
Defender</a>. It’s a 32-bit 80386 real mode DOS COM program. All
assets are embedded in the executable and there are no external
dependencies, so the entire game is packed into that 10kB binary.</p>

<ul>
  <li><a href="https://github.com/skeeto/dosdefender-ld31">https://github.com/skeeto/dosdefender-ld31</a></li>
  <li><a href="https://github.com/skeeto/dosdefender-ld31/releases/download/1.1.0/DOSDEF.COM">DOSDEF.COM</a> (10kB, v1.1.0, run in DOSBox)</li>
</ul>

<p><img src="/img/screenshot/dosdefender.gif" alt="" /></p>

<p>You’ll need a joystick/gamepad in order to play. I included mouse
support in the Ludum Dare release in order to make it easier to
review, but this was removed because it doesn’t work well.</p>

<p>The most technically interesting part is that <strong>I didn’t need <em>any</em>
DOS development tools to create this</strong>! I only used my every day Linux
C compiler (<code class="language-plaintext highlighter-rouge">gcc</code>). It’s not actually possible to build DOS Defender
in DOS. Instead, I’m treating DOS as an embedded platform, which is
the only form in which <a href="http://www.freedos.org/">DOS still exists today</a>. Along with
DOSBox and <a href="http://www.dosemu.org/">DOSEMU</a>, this is a pretty comfortable toolchain.</p>

<p>If all you care about is how to do this yourself, skip to the
“Tricking GCC” section, where we’ll write a “Hello, World” DOS COM
program with Linux’s GCC.</p>

<h3 id="finding-the-right-tools">Finding the right tools</h3>

<p>I didn’t have GCC in mind when I started this project. What really
triggered all of this was that I had noticed Debian’s <a href="http://linux.die.net/man/1/bcc">bcc</a>
package, Bruce’s C Compiler, that builds 16-bit 8086 binaries. It’s
kept around for compiling x86 bootloaders and such, but it can also be
used to compile DOS COM files, which was the part that interested me.</p>

<p>For some background: the Intel 8086 was a 16-bit microprocessor
released in 1978. It had none of the fancy features of today’s CPU: no
memory protection, no floating point instructions, and only up to 1MB
of RAM addressable. All modern x86 desktops and laptops can still
pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same
limited addressing and all. That’s some serious backwards
compatibility. This feature is called <em>real mode</em>. It’s the mode in
which all x86 computers boot. Modern operating systems switch to
<em>protected mode</em> as soon as possible, which provides virtual
addressing and safe multi-tasking. DOS is not one of these operating
systems.</p>

<p>Unfortunately, bcc is not an ANSI C compiler. It supports a subset of
K&amp;R C, along with inline x86 assembly. Unlike other 8086 C compilers,
it has no notion of “far” or “long” pointers, so inline assembly is
required to access <a href="http://en.wikipedia.org/wiki/X86_memory_segmentation">other memory segments</a> (VGA, clock, etc.).
Side note: the remnants of these 8086 “long pointers” still exists
today in the Win32 API: <code class="language-plaintext highlighter-rouge">LPSTR</code>, <code class="language-plaintext highlighter-rouge">LPWORD</code>, <code class="language-plaintext highlighter-rouge">LPDWORD</code>, etc. The inline
assembly isn’t anywhere near as nice as GCC’s inline assembly. The
assembly code has to manually load variables from the stack so, since
bcc supports two different calling conventions, the assembly ends up
being hard-coded to one calling convention or the other.</p>

<p>Given all its limitations, I went looking for alternatives.</p>

<h3 id="djgpp">DJGPP</h3>

<p><a href="http://www.delorie.com/djgpp/">DJGPP</a> is the DOS port of GCC. It’s a very impressive project,
bringing almost all of POSIX to DOS. The DOS ports of many programs
are built with DJGPP. In order to achieve this, it only produces
32-bit protected mode programs. If a protected mode program needs to
manipulate hardware (i.e. VGA), it must make requests to a <a href="http://en.wikipedia.org/wiki/DOS_Protected_Mode_Interface">DOS
Protected Mode Interface</a> (DPMI) service. If I used DJGPP, I
couldn’t make a single, standalone binary as I had wanted, since I’d
need to include a DPMI server. There’s also a performance penalty for
making DPMI requests.</p>

<p>Getting a DJGPP toolchain working can be difficult, to put it kindly.
Fortunately I found a useful project, <a href="https://github.com/andrewwutw/build-djgpp">build-djgpp</a>, that makes
it easy, at least on Linux.</p>

<p>Either there’s a serious bug or the official DJGPP binaries <a href="http://www.delorie.com/djgpp/v2faq/faq6_7.html">have
become infected again</a>, because in my testing I kept getting
the “Not COFF: check for viruses” error message when running my
programs in DOSBox. To double check that it’s not an infection on my
own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as
a clean room. It’s impossible for this ARM-based device to get
infected with an x86 virus. It still had the same problem, and all the
binary hashes matched up between the machines, so it’s not my fault.</p>

<p>So given the DPMI issue and the above, I moved on.</p>

<h3 id="tricking-gcc">Tricking GCC</h3>

<p>What I finally settled on is a neat hack that involves “tricking” GCC
into producing real mode DOS COM files, so long as it can target 80386
(as is usually the case). The 80386 was released in 1985 and was the
first 32-bit x86 microprocessor. GCC still targets this instruction
set today, even in the x86-64 toolchain. Unfortunately, GCC cannot
actually produce 16-bit code, so my main goal of targeting 8086 would
not be achievable. This doesn’t matter, though, since DOSBox, my
intended platform, is an 80386 emulator.</p>

<p>In theory this should even work unchanged with MinGW, but there’s a
long-standing MinGW bug that prevents it from working right (“cannot
perform PE operations on non PE output file”). It’s still do-able, and
I did it myself, but you’ll need to drop the <code class="language-plaintext highlighter-rouge">OUTPUT_FORMAT</code> directive
and add an extra <code class="language-plaintext highlighter-rouge">objcopy</code> step (<code class="language-plaintext highlighter-rouge">objcopy -O binary</code>).</p>

<h4 id="hello-world-in-dos">Hello World in DOS</h4>

<p>To demonstrate how to do all this, let’s make a DOS “Hello, World” COM
program using GCC on Linux.</p>

<p>There’s a significant burden with this technique: <strong>there will be no
standard library</strong>. It’s basically like writing an operating system
from scratch, except for the few services DOS provides. This means no
<code class="language-plaintext highlighter-rouge">printf()</code> or anything of the sort. Instead we’ll ask DOS to print a
string to the terminal. Making a request to DOS means firing an
interrupt, which means inline assembly!</p>

<p>DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26,
0x27, 0x2F. The big one, and the one we’re interested in, is 0x21,
function 0x09 (print string). Between DOS and BIOS, there are
<a href="http://www.o3one.org/hwdocs/bios_doc/dosref22.html">thousands of functions called this way</a>. I’m not going to try
to explain x86 assembly, but in short the function number is stuffed
into register <code class="language-plaintext highlighter-rouge">ah</code> and interrupt 0x21 is fired. Function 0x09 also
takes an argument, the pointer to the string to be printed, which is
passed in registers <code class="language-plaintext highlighter-rouge">dx</code> and <code class="language-plaintext highlighter-rouge">ds</code>.</p>

<p>Here’s the GCC inline assembly <code class="language-plaintext highlighter-rouge">print()</code> function. Strings passed to
this function must be terminated with a <code class="language-plaintext highlighter-rouge">$</code>. Why? Because DOS.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"mov   $0x09, %%ah</span><span class="se">\n</span><span class="s">"</span>
                  <span class="s">"int   $0x21</span><span class="se">\n</span><span class="s">"</span>
                  <span class="o">:</span> <span class="cm">/* no output */</span>
                  <span class="o">:</span> <span class="s">"d"</span><span class="p">(</span><span class="n">string</span><span class="p">)</span>
                  <span class="o">:</span> <span class="s">"ah"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The assembly is declared <code class="language-plaintext highlighter-rouge">volatile</code> because it has a side effect
(printing the string). To GCC, the assembly is an opaque hunk, and the
optimizer relies in the output/input/clobber constraints (the last
three lines). For DOS programs like this, all inline assembly will
have side effects. This is because it’s not being written for
optimization but to access hardware and DOS, things not accessible to
plain C.</p>

<p>Care must also be taken by the caller, because GCC doesn’t know that
the memory pointed to by <code class="language-plaintext highlighter-rouge">string</code> is ever read. It’s likely the array
that backs the string needs to be declared <code class="language-plaintext highlighter-rouge">volatile</code> too. This is all
foreshadowing into what’s to come: doing anything in this environment
is an endless struggle against the optimizer. Not all of these battles
can be won.</p>

<p>Now for the main function. The name of this function shouldn’t matter,
but I’m avoiding calling it <code class="language-plaintext highlighter-rouge">main()</code> since MinGW has a funny ideas
about mangling this particular symbol, even when it’s asked not to.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">dosmain</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">print</span><span class="p">(</span><span class="s">"Hello, World!</span><span class="se">\n</span><span class="s">$"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>COM files are limited to 65,279 bytes in size. This is because an x86
memory segment is 64kB and COM files are simply loaded by DOS to
0x0100 in the segment and executed. There are no headers, it’s just a
raw binary. Since a COM program can never be of any significant size,
and no real linking needs to occur (freestanding), the entire thing
will be compiled as one translation unit. It will be one call to GCC
with a bunch of options.</p>

<h4 id="compiler-options">Compiler Options</h4>

<p>Here are the essential compiler options.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding
</code></pre></div></div>

<p>Since no standard libraries are in use, the only difference between
gnu99 and c99 is that trigraphs are disabled (as they should be) and
inline assembly can be written as <code class="language-plaintext highlighter-rouge">asm</code> instead of <code class="language-plaintext highlighter-rouge">__asm__</code>. It’s a
no brainer. This project will be so closely tied to GCC that I don’t
care about using GCC extensions anyway.</p>

<p>I’m using <code class="language-plaintext highlighter-rouge">-Os</code> to keep the compiled output as small as possible. It
will also make the program run faster. This is important when
targeting DOSBox because, by default, it will deliberately run as slow
as a machine from the 1980’s. I want to be able to fit in that
constraint. If the optimizer is causing problems, you may need to
temporarily make this <code class="language-plaintext highlighter-rouge">-O0</code> to determine if the problem is your fault
or the optimizer’s fault.</p>

<p>You see, the optimizer doesn’t understand that the program will be
running in real mode, and under its addressing constraints. <strong>It will
perform all sorts of invalid optimizations that break your perfectly
valid programs.</strong> It’s not a GCC bug since we’re doing crazy stuff
here. I had to rework my code a number of times to stop the optimizer
from breaking my program. For example, I had to avoid returning
complex structs from functions because they’d sometimes be filled with
garbage. The real danger here is that a future version of GCC will be
more clever and will break more stuff. In this battle, <code class="language-plaintext highlighter-rouge">volatile</code> is
your friend.</p>

<p>Th next option is <code class="language-plaintext highlighter-rouge">-nostdlib</code>, since there are no valid libraries for
us to link against, even statically.</p>

<p>The options <code class="language-plaintext highlighter-rouge">-m32 -march=i386</code> set the compiler to produce 80386 code.
If I was writing a bootloader for a modern computer, targeting 80686
would be fine, too, but DOSBox is 80386.</p>

<p>The <code class="language-plaintext highlighter-rouge">-ffreestanding</code> argument requires that GCC not emit code that
calls built-in standard library helper functions. Sometimes instead of
emitting code to do something, it emits code that calls a built-in
function to do it, especially with math operators. This was one of the
main problems I had with bcc, where this behavior couldn’t be
disabled. This is most commonly used in writing bootloaders and
kernels. And now DOS COM files.</p>

<h4 id="linker-options">Linker Options</h4>

<p>The <code class="language-plaintext highlighter-rouge">-Wl</code> option is used to pass arguments to the linker (<code class="language-plaintext highlighter-rouge">ld</code>). We
need it since we’re doing all this in one call to GCC.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-Wl,--nmagic,--script=com.ld
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">--nmagic</code> turns off page alignment of sections. One, we don’t
need this. Two, that would waste precious space. In my tests it
doesn’t appear to be necessary, but I’m including it just in case.</p>

<p>The <code class="language-plaintext highlighter-rouge">--script</code> option tells the linker that we want to use a custom
<a href="http://wiki.osdev.org/Linker_Scripts">linker script</a>. This allows us to precisely lay out the sections
(<code class="language-plaintext highlighter-rouge">text</code>, <code class="language-plaintext highlighter-rouge">data</code>, <code class="language-plaintext highlighter-rouge">bss</code>, <code class="language-plaintext highlighter-rouge">rodata</code>) of our program. Here’s the <code class="language-plaintext highlighter-rouge">com.ld</code>
script.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OUTPUT_FORMAT(binary)
SECTIONS
{
    . = 0x0100;
    .text :
    {
        *(.text);
    }
    .data :
    {
        *(.data);
        *(.bss);
        *(.rodata);
    }
    _heap = ALIGN(4);
}
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">OUTPUT_FORMAT(binary)</code> says not to put this into an ELF (or PE,
etc.) file. The linker should just dump the raw code. A COM file is
just raw code, so this means the linker will produce a COM file!</p>

<p>I had said that COM files are loaded to <code class="language-plaintext highlighter-rouge">0x0100</code>. The fourth line
offsets the binary to this location. The first byte of the COM file
will still be the first byte of code, but it will be designed to run
from that offset in memory.</p>

<p>What follows is all the sections, <code class="language-plaintext highlighter-rouge">text</code> (program), <code class="language-plaintext highlighter-rouge">data</code> (static
data), <code class="language-plaintext highlighter-rouge">bss</code> (zero-initialized data), <code class="language-plaintext highlighter-rouge">rodata</code> (strings). Finally I
mark the end of the binary with the symbol <code class="language-plaintext highlighter-rouge">_heap</code>. This will come in
handy later for writing <code class="language-plaintext highlighter-rouge">sbrk()</code>, after we’re done with “Hello,
World.” I’ve asked for the <code class="language-plaintext highlighter-rouge">_heap</code> position to be 4-byte aligned.</p>

<p>We’re almost there.</p>

<h4 id="program-startup">Program Startup</h4>

<p>The linker is usually aware of our entry point (<code class="language-plaintext highlighter-rouge">main</code>) and sets that
up for us. But since we asked for “binary” output, we’re on our own.
If the <code class="language-plaintext highlighter-rouge">print()</code> function is emitted first, our program’s execution
will begin with executing that function, which is invalid. Our program
needs a little header stanza to get things started.</p>

<p>The linker script has a <code class="language-plaintext highlighter-rouge">STARTUP</code> option for handling this, but to
keep it simple we’ll put that right in the program. This is usually
called <code class="language-plaintext highlighter-rouge">crt0.o</code> or <code class="language-plaintext highlighter-rouge">Boot.o</code>, in case those names every come up in your
own reading. This inline assembly <em>must</em> be the very first thing in
our code, before any includes and such. DOS will do most of the setup
for us, we really just have to jump to the entry point.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span><span class="s">".code16gcc</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"call  dosmain</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"mov   $0x4C, %ah</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"int   $0x21</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">.code16gcc</code> tells the assembler that we’re going to be running in
real mode, so that it makes the proper adjustment. Despite the name,
this will <em>not</em> make it produce 16-bit code! First it calls <code class="language-plaintext highlighter-rouge">dosmain</code>,
the function we wrote above. Then it informs DOS, using function
<code class="language-plaintext highlighter-rouge">0x4C</code> (terminate with return code), that we’re done, passing the exit
code along in the 1-byte register <code class="language-plaintext highlighter-rouge">al</code> (already set by <code class="language-plaintext highlighter-rouge">dosmain</code>).
This inline assembly is automatically <code class="language-plaintext highlighter-rouge">volatile</code> because it has no
inputs or outputs.</p>

<h4 id="everything-at-once">Everything at Once</h4>

<p>Here’s the entire C program.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asm</span> <span class="p">(</span><span class="s">".code16gcc</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"call  dosmain</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"mov   $0x4C,%ah</span><span class="se">\n</span><span class="s">"</span>
     <span class="s">"int   $0x21</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">asm</span> <span class="k">volatile</span> <span class="p">(</span><span class="s">"mov   $0x09, %%ah</span><span class="se">\n</span><span class="s">"</span>
                  <span class="s">"int   $0x21</span><span class="se">\n</span><span class="s">"</span>
                  <span class="o">:</span> <span class="cm">/* no output */</span>
                  <span class="o">:</span> <span class="s">"d"</span><span class="p">(</span><span class="n">string</span><span class="p">)</span>
                  <span class="o">:</span> <span class="s">"ah"</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">dosmain</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">print</span><span class="p">(</span><span class="s">"Hello, World!</span><span class="se">\n</span><span class="s">$"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I won’t repeat <code class="language-plaintext highlighter-rouge">com.ld</code>. Here’s the call to GCC.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
    -o hello.com -Wl,--nmagic,--script=com.ld hello.c
</code></pre></div></div>

<p>And testing it in DOSBox:</p>

<p><img src="/img/screenshot/dosbox-hello.png" alt="" /></p>

<p>From here if you want fancy graphics, it’s just a matter of making an
interrupt and <a href="http://www.brackeen.com/vga/index.html">writing to VGA memory</a>. If you want sound you can
perform an interrupt for the PC speaker. I haven’t sorted out how to
call Sound Blaster yet. It was from this point that I grew DOS
Defender.</p>

<h3 id="memory-allocation">Memory Allocation</h3>

<p>To cover one more thing, remember that <code class="language-plaintext highlighter-rouge">_heap</code> symbol? We can use it
to implement <code class="language-plaintext highlighter-rouge">sbrk()</code> for dynamic memory allocation within the main
program segment. This is real mode, and there’s no virtual memory, so
we’re free to write to any memory we can address at any time. Some of
this is reserved (i.e. low and high memory) for hardware. So using
<code class="language-plaintext highlighter-rouge">sbrk()</code> specifically isn’t <em>really</em> necessary, but it’s interesting
to implement ourselves.</p>

<p>As is normal on x86, your text and segments are at a low address
(0x0100 in this case) and the stack is at a high address (around
0xffff in this case). On Unix-like systems, the memory returned by
<code class="language-plaintext highlighter-rouge">malloc()</code> comes from two places: <code class="language-plaintext highlighter-rouge">sbrk()</code> and <code class="language-plaintext highlighter-rouge">mmap()</code>. What <code class="language-plaintext highlighter-rouge">sbrk()</code>
does is allocates memory just above the text/data segments, growing
“up” towards the stack. Each call to <code class="language-plaintext highlighter-rouge">sbrk()</code> will grow this space (or
leave it exactly the same). That memory would then managed by
<code class="language-plaintext highlighter-rouge">malloc()</code> and friends.</p>

<p>Here’s how we can get <code class="language-plaintext highlighter-rouge">sbrk()</code> in a COM program. Notice I have to
define my own <code class="language-plaintext highlighter-rouge">size_t</code>, since we don’t have a standard library.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="kt">unsigned</span> <span class="kt">short</span>  <span class="kt">size_t</span><span class="p">;</span>

<span class="k">extern</span> <span class="kt">char</span> <span class="n">_heap</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">hbreak</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">_heap</span><span class="p">;</span>

<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">sbrk</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">ptr</span> <span class="o">=</span> <span class="n">hbreak</span><span class="p">;</span>
    <span class="n">hbreak</span> <span class="o">+=</span> <span class="n">size</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">ptr</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It just sets a pointer to <code class="language-plaintext highlighter-rouge">_heap</code> and grows it as needed. A slightly
smarter <code class="language-plaintext highlighter-rouge">sbrk()</code> would be careful about alignment as well.</p>

<p>In the making of DOS Defender an interesting thing happened. I was
(incorrectly) counting on the memory return by my <code class="language-plaintext highlighter-rouge">sbrk()</code> being
zeroed. This was the case the first time the game ran. However, DOS
doesn’t zero this memory between programs. When I would run my game
again, <em>it would pick right up where it left off</em>, because the same
data structures with the same contents were loaded back into place. A
pretty cool accident! It’s part of what makes this a fun embedded
platform.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>C Object Oriented Programming</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/10/21/"/>
    <id>urn:uuid:3851ee30-1f9d-35af-e59f-e4be5023b2d5</id>
    <updated>2014-10-21T03:52:43Z</updated>
    <category term="c"/><category term="cpp"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><del>Object oriented programming, polymorphism in particular, is
essential to nearly any large, complex software system. Without it,
decoupling different system components is difficult.</del> (<em>Update in
2017</em>: I no longer agree with this statement.) C doesn’t come with
object oriented capabilities, so large C programs tend to grow their
own out of C’s primitives. This includes huge C projects like the
Linux kernel, BSD kernels, and SQLite.</p>

<h3 id="starting-simple">Starting Simple</h3>

<p>Suppose you’re writing a function <code class="language-plaintext highlighter-rouge">pass_match()</code> that takes an input
stream, an output stream, and a pattern. It works sort of like grep.
It passes to the output each line of input that matches the pattern.
The pattern string contains a shell glob pattern to be handled by
<a href="http://man7.org/linux/man-pages/man3/fnmatch.3.html">POSIX <code class="language-plaintext highlighter-rouge">fnmatch()</code></a>. Here’s what the interface looks like.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">pass_match</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">);</span>
</code></pre></div></div>

<p>Glob patterns are simple enough that pre-compilation, as would be done
for a regular expression, is unnecessary. The bare string is enough.</p>

<p>Some time later the customer wants the program to support regular
expressions in addition to shell-style glob patterns. For efficiency’s
sake, regular expressions need to be pre-compiled and so will not be
passed to the function as a string. It will instead be a <a href="http://man7.org/linux/man-pages/man3/regexec.3.html">POSIX
<code class="language-plaintext highlighter-rouge">regex_t</code></a> object. A quick-and-dirty approach might be to
accept both and match whichever one isn’t NULL.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">pass_match</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">,</span> <span class="n">regex_t</span> <span class="o">*</span><span class="n">re</span><span class="p">);</span>
</code></pre></div></div>

<p>Bleh. This is ugly and won’t scale well. What happens when more kinds
of filters are needed? It would be much better to accept a single
object that covers both cases, and possibly even another kind of
filter in the future.</p>

<h3 id="a-generalized-filter">A Generalized Filter</h3>

<p>One of the most common ways to customize the the behavior of a
function in C is to pass a function pointer. For example, the final
argument to <a href="http://man7.org/linux/man-pages/man3/qsort.3.html"><code class="language-plaintext highlighter-rouge">qsort()</code></a> is a comparator that determines how
objects get sorted.</p>

<p>For <code class="language-plaintext highlighter-rouge">pass_match()</code>, this function would accept a string and return a
boolean value deciding if the string should be passed to the output
stream. It gets called once on each line of input.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">pass_match</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>However, this has one of the <a href="/blog/2014/08/29/">same problems as <code class="language-plaintext highlighter-rouge">qsort()</code></a>:
the passed function lacks context. It needs a pattern string or
<code class="language-plaintext highlighter-rouge">regex_t</code> object to operate on. In other languages these would be
attached to the function as a closure, but C doesn’t have closures. It
would need to be smuggled in via a global variable, <a href="/blog/2014/10/12/">which is not
good</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">regex_t</span> <span class="n">regex</span><span class="p">;</span>  <span class="c1">// BAD!!!</span>

<span class="n">bool</span> <span class="nf">regex_match</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">regexec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because of the global variable, in practice <code class="language-plaintext highlighter-rouge">pass_match()</code> would be
neither reentrant nor thread-safe. We could take a lesson from GNU’s
<code class="language-plaintext highlighter-rouge">qsort_r()</code> and accept a context to be passed to the filter function.
This simulates a closure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">pass_match</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">in</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">out</span><span class="p">,</span>
                <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="p">),</span> <span class="kt">void</span> <span class="o">*</span><span class="n">context</span><span class="p">);</span>
</code></pre></div></div>

<p>The provided context pointer would be passed to the filter function as
the second argument, and no global variables are needed. This would
probably be good enough for most purposes and it’s about as simple as
possible. The interface to <code class="language-plaintext highlighter-rouge">pass_match()</code> would cover any kind of
filter.</p>

<p>But wouldn’t it be nice to package the function and context together
as one object?</p>

<h3 id="more-abstraction">More Abstraction</h3>

<p>How about putting the context on a struct and making an interface out
of that? Here’s a tagged union that behaves as one or the other.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">filter_type</span> <span class="p">{</span> <span class="n">GLOB</span><span class="p">,</span> <span class="n">REGEX</span> <span class="p">};</span>

<span class="k">struct</span> <span class="n">filter</span> <span class="p">{</span>
    <span class="k">enum</span> <span class="n">filter_type</span> <span class="n">type</span><span class="p">;</span>
    <span class="k">union</span> <span class="p">{</span>
        <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">;</span>
        <span class="n">regex_t</span> <span class="n">regex</span><span class="p">;</span>
    <span class="p">}</span> <span class="n">context</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>There’s one function for interacting with this struct:
<code class="language-plaintext highlighter-rouge">filter_match()</code>. It checks the <code class="language-plaintext highlighter-rouge">type</code> member and calls the correct
function with the correct context.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">filter_match</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">filter</span><span class="o">-&gt;</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">case</span> <span class="n">GLOB</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">fnmatch</span><span class="p">(</span><span class="n">filter</span><span class="o">-&gt;</span><span class="n">context</span><span class="p">.</span><span class="n">pattern</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">case</span> <span class="n">REGEX</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">regexec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">filter</span><span class="o">-&gt;</span><span class="n">context</span><span class="p">.</span><span class="n">regex</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">abort</span><span class="p">();</span> <span class="c1">// programmer error</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the <code class="language-plaintext highlighter-rouge">pass_match()</code> API now looks like this. This will be the final
change to <code class="language-plaintext highlighter-rouge">pass_match()</code>, both in implementation and interface.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">pass_match</span><span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">input</span><span class="p">,</span> <span class="kt">FILE</span> <span class="o">*</span><span class="n">output</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">);</span>
</code></pre></div></div>

<p>It still doesn’t care how the filter works, so it’s good enough to
cover all future cases. It just calls <code class="language-plaintext highlighter-rouge">filter_match()</code> on the pointer
it was given. However, the <code class="language-plaintext highlighter-rouge">switch</code> and tagged union aren’t friendly
to extension. Really, it’s outright hostile. We finally have some
degree of polymorphism, but it’s crude. It’s like building duct tape
into a design. Adding new behavior means adding another <code class="language-plaintext highlighter-rouge">switch</code> case.
This is a step backwards. We can do better.</p>

<h4 id="methods">Methods</h4>

<p>With the <code class="language-plaintext highlighter-rouge">switch</code> we’re no longer taking advantage of function
pointers. So what about putting a function pointer on the struct?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="p">{</span>
    <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The filter itself is passed as the first argument, providing context.
In object oriented languages, that’s the implicit <code class="language-plaintext highlighter-rouge">this</code> argument. To
avoid requiring the caller to worry about this detail, we’ll hide it
in a new <code class="language-plaintext highlighter-rouge">switch</code>-free version of <code class="language-plaintext highlighter-rouge">filter_match()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bool</span> <span class="nf">filter_match</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">filter</span><span class="o">-&gt;</span><span class="n">match</span><span class="p">(</span><span class="n">filter</span><span class="p">,</span> <span class="n">string</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice we’re still lacking the actual context, the pattern string or
the regex object. Those will be different structs that embed the
filter struct.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter_regex</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="n">filter</span><span class="p">;</span>
    <span class="n">regex_t</span> <span class="n">regex</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">filter_glob</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="n">filter</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>For both the original filter struct is the first member. This is
critical. We’re going to be using a trick called <em>type punning</em>. The
first member is guaranteed to be positioned at the beginning of the
struct, so a pointer to a <code class="language-plaintext highlighter-rouge">struct filter_glob</code> is also a pointer to a
<code class="language-plaintext highlighter-rouge">struct filter</code>. Notice any resemblance to inheritance?</p>

<p>Each type, glob and regex, needs its own match method.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="p">)</span> <span class="n">filter</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">regexec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_glob</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_glob</span> <span class="o">*</span><span class="n">glob</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">filter_glob</span> <span class="o">*</span><span class="p">)</span> <span class="n">filter</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">fnmatch</span><span class="p">(</span><span class="n">glob</span><span class="o">-&gt;</span><span class="n">pattern</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I’ve prefixed them with <code class="language-plaintext highlighter-rouge">method_</code> to indicate their intended usage. I
declared these <code class="language-plaintext highlighter-rouge">static</code> because they’re completely private. Other
parts of the program will only be accessing them through a function
pointer on the struct. This means we need some constructors in order
to set up those function pointers. (For simplicity, I’m not error
checking.)</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="nf">filter_regex_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">regex</span><span class="p">));</span>
    <span class="n">regcomp</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">,</span> <span class="n">pattern</span><span class="p">,</span> <span class="n">REG_EXTENDED</span><span class="p">);</span>
    <span class="n">regex</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">.</span><span class="n">match</span> <span class="o">=</span> <span class="n">method_match_regex</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="nf">filter_glob_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_glob</span> <span class="o">*</span><span class="n">glob</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">glob</span><span class="p">));</span>
    <span class="n">glob</span><span class="o">-&gt;</span><span class="n">pattern</span> <span class="o">=</span> <span class="n">pattern</span><span class="p">;</span>
    <span class="n">glob</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">.</span><span class="n">match</span> <span class="o">=</span> <span class="n">method_match_glob</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">glob</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now this is real polymorphism. It’s really simple from the user’s
perspective. They call the correct constructor and get a filter object
that has the desired behavior. This object can be passed around
trivially, and no other part of the program worries about how it’s
implemented. Best of all, since each method is a separate function
rather than a <code class="language-plaintext highlighter-rouge">switch</code> case, new kinds of filter subtypes can be
defined independently. Users can create their own filter types that
work just as well as the two “built-in” filters.</p>

<h4 id="cleaning-up">Cleaning Up</h4>

<p>Oops, the regex filter needs to be cleaned up when it’s done, but the
user, by design, won’t know how to do it. Let’s add a <code class="language-plaintext highlighter-rouge">free()</code> method.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="p">{</span>
    <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>

<span class="kt">void</span> <span class="nf">filter_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">filter</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">(</span><span class="n">filter</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And the methods for each. These would also be assigned in the
constructor.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="p">)</span> <span class="n">f</span><span class="p">;</span>
    <span class="n">regfree</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_glob</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The glob constructor should perhaps <code class="language-plaintext highlighter-rouge">strdup()</code> its pattern as a
private copy, in which case it would be freed here.</p>

<h3 id="object-composition">Object Composition</h3>

<p>A good rule of thumb is to prefer composition over inheritance. Having
tidy filter objects opens up some interesting possibilities for
composition. Here’s an AND filter that composes two arbitrary filter
objects. It only matches when both its subfilters match. It supports
short circuiting, so put the faster, or most discriminating, filter
first in the constructor (user’s responsibility).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter_and</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="n">filter</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">sub</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="p">};</span>

<span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_and</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_and</span> <span class="o">*</span><span class="n">and</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">filter_and</span> <span class="o">*</span><span class="p">)</span> <span class="n">f</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">filter_match</span><span class="p">(</span><span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">s</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">filter_match</span><span class="p">(</span><span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">s</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_and</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_and</span> <span class="o">*</span><span class="n">and</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">filter_and</span> <span class="o">*</span><span class="p">)</span> <span class="n">f</span><span class="p">;</span>
    <span class="n">filter_free</span><span class="p">(</span><span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="n">filter_free</span><span class="p">(</span><span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="nf">filter_and</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_and</span> <span class="o">*</span><span class="n">and</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">and</span><span class="p">));</span>
    <span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="n">and</span><span class="o">-&gt;</span><span class="n">sub</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">and</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">.</span><span class="n">match</span> <span class="o">=</span> <span class="n">method_match_and</span><span class="p">;</span>
    <span class="n">and</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">.</span><span class="n">free</span> <span class="o">=</span> <span class="n">method_free_and</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">and</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It can combine a regex filter and a glob filter, or two regex filters,
or two glob filters, or even other AND filters. It doesn’t care what
the subfilters are. Also, the <code class="language-plaintext highlighter-rouge">free()</code> method here frees its
subfilters. This means that the user doesn’t need to keep hold of
every filter created, just the “top” one in the composition.</p>

<p>To make composition filters easier to use, here are two “constant”
filters. These are statically allocated, shared, and are never
actually freed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_any</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="nb">true</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_none</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_noop</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">filter</span> <span class="n">FILTER_ANY</span>  <span class="o">=</span> <span class="p">{</span> <span class="n">method_match_any</span><span class="p">,</span>  <span class="n">method_free_noop</span> <span class="p">};</span>
<span class="k">struct</span> <span class="n">filter</span> <span class="n">FILTER_NONE</span> <span class="o">=</span> <span class="p">{</span> <span class="n">method_match_none</span><span class="p">,</span> <span class="n">method_free_noop</span> <span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">FILTER_NONE</code> filter will generally be used with a (theoretical)
<code class="language-plaintext highlighter-rouge">filter_or()</code> and <code class="language-plaintext highlighter-rouge">FILTER_ANY</code> will generally be used with the
previously defined <code class="language-plaintext highlighter-rouge">filter_and()</code>.</p>

<p>Here’s a simple program that composes multiple glob filters into a
single filter, one for each program argument.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">filter</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">FILTER_ANY</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">char</span> <span class="o">**</span><span class="n">p</span> <span class="o">=</span> <span class="n">argv</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span> <span class="n">p</span><span class="o">++</span><span class="p">)</span>
        <span class="n">filter</span> <span class="o">=</span> <span class="n">filter_and</span><span class="p">(</span><span class="n">filter_glob_create</span><span class="p">(</span><span class="o">*</span><span class="n">p</span><span class="p">),</span> <span class="n">filter</span><span class="p">);</span>
    <span class="n">pass_match</span><span class="p">(</span><span class="n">stdin</span><span class="p">,</span> <span class="n">stdout</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="n">filter_free</span><span class="p">(</span><span class="n">filter</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice only one call to <code class="language-plaintext highlighter-rouge">filter_free()</code> is needed to clean up the
entire filter.</p>

<h3 id="multiple-inheritance">Multiple Inheritance</h3>

<p>As I mentioned before, the filter struct must be the first member of
filter subtype structs in order for type punning to work. If we want
to “inherit” from two different types like this, they would both need
to be in this position: a contradiction.</p>

<p>Fortunately type punning can be generalized such that it the
first-member constraint isn’t necessary. This is commonly done through
a <code class="language-plaintext highlighter-rouge">container_of()</code> macro. Here’s a C99-conforming definition.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span>
<span class="cp">#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))
</span></code></pre></div></div>

<p>Given a pointer to a member of a struct, the <code class="language-plaintext highlighter-rouge">container_of()</code> macro
allows us to back out to the containing struct. Suppose the regex
struct was defined differently, so that the <code class="language-plaintext highlighter-rouge">regex_t</code> member came
first.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter_regex</span> <span class="p">{</span>
    <span class="n">regex_t</span> <span class="n">regex</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="n">filter</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The constructor remains unchanged. The casts in the methods change to
the macro.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter_regex</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">regexec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter_regex</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="n">regfree</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>

<span class="p">}</span>
</code></pre></div></div>

<p>It’s a constant, compile-time computed offset, so there should be no
practical performance impact. The filter can now participate freely in
other <em>intrusive</em> data structures, like linked lists and such. It’s
analogous to multiple inheritance.</p>

<h3 id="vtables">Vtables</h3>

<p>Say we want to add a third method, <code class="language-plaintext highlighter-rouge">clone()</code>, to the filter API, to
make an independent copy of a filter, one that will need to be
separately freed. It will be like the copy assignment operator in C++.
Each kind of filter will need to define an appropriate “method” for
it. As long as new methods like this are added at the end, this
doesn’t break the API, but it does break the ABI regardless.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="p">{</span>
    <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">clone</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The filter object is starting to get big. It’s got three pointers —
24 bytes on modern systems — and these pointers are the same between
all instances of the same type. That’s a lot of redundancy. Instead,
these pointers could be shared between instances in a common table
called a <em>virtual method table</em>, commonly known as a <em>vtable</em>.</p>

<p>Here’s a vtable version of the filter API. The overhead is now only
one pointer regardless of the number of methods in the interface.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_vtable</span> <span class="o">*</span><span class="n">vtable</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">filter_vtable</span> <span class="p">{</span>
    <span class="n">bool</span> <span class="p">(</span><span class="o">*</span><span class="n">match</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">);</span>
    <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">free</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">(</span><span class="o">*</span><span class="n">clone</span><span class="p">)(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Each type creates its own vtable and links to it in the constructor.
Here’s the regex filter re-written for the new vtable API and clone
method. This is all the tricks in one basket for a big object oriented
C finale!</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="nf">filter_regex_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">);</span>

<span class="k">struct</span> <span class="n">filter_regex</span> <span class="p">{</span>
    <span class="n">regex_t</span> <span class="n">regex</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">filter</span> <span class="n">filter</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="n">bool</span>
<span class="nf">method_match_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter_regex</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">regexec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">,</span> <span class="n">string</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">method_free_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter_regex</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="n">regfree</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">f</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span>
<span class="nf">method_clone_regex</span><span class="p">(</span><span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="k">struct</span> <span class="n">filter_regex</span><span class="p">,</span> <span class="n">filter</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">filter_regex_create</span><span class="p">(</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">pattern</span><span class="p">);</span>
<span class="p">}</span>

<span class="cm">/* vtable */</span>
<span class="k">struct</span> <span class="n">filter_vtable</span> <span class="n">filter_regex_vtable</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">method_match_regex</span><span class="p">,</span> <span class="n">method_free_regex</span><span class="p">,</span> <span class="n">method_clone_regex</span>
<span class="p">};</span>

<span class="cm">/* constructor */</span>
<span class="k">struct</span> <span class="n">filter</span> <span class="o">*</span><span class="nf">filter_regex_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">pattern</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">filter_regex</span> <span class="o">*</span><span class="n">regex</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">regex</span><span class="p">));</span>
    <span class="n">regex</span><span class="o">-&gt;</span><span class="n">pattern</span> <span class="o">=</span> <span class="n">pattern</span><span class="p">;</span>
    <span class="n">regcomp</span><span class="p">(</span><span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">regex</span><span class="p">,</span> <span class="n">pattern</span><span class="p">,</span> <span class="n">REG_EXTENDED</span><span class="p">);</span>
    <span class="n">regex</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">.</span><span class="n">vtable</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">filter_regex_vtable</span><span class="p">;</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">regex</span><span class="o">-&gt;</span><span class="n">filter</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This is almost exactly what’s going on behind the scenes in C++. When
a method/function is declared <code class="language-plaintext highlighter-rouge">virtual</code>, and therefore dispatches
based on the run-time type of its left-most argument, it’s listed in
the vtables for classes that implement it. Otherwise it’s just a
normal function. This is why functions need to be declared <code class="language-plaintext highlighter-rouge">virtual</code>
ahead of time in C++.</p>

<p>In conclusion, it’s relatively easy to get the core benefits of object
oriented programming in plain old C. It doesn’t require heavy use of
macros, nor do users of these systems need to know that underneath
it’s an object system, unless they want to extend it for themselves.</p>

<p>Here’s the whole example program once if you’re interested in poking:</p>

<ul>
  <li><a href="https://gist.github.com/skeeto/5faa131b19673549d8ca">https://gist.github.com/skeeto/5faa131b19673549d8ca</a></li>
</ul>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Global State: a Tale of Two Bad C APIs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/10/12/"/>
    <id>urn:uuid:8a1c5135-e669-308b-6605-58c86be3003b</id>
    <updated>2014-10-12T22:48:00Z</updated>
    <category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>Mutable global variables are evil. You’ve almost certainly heard that
before, but it’s worth repeating. It makes programs, and libraries
especially, harder to understand, harder to optimize, more fragile,
more error prone, and less useful. If you’re using global state in a
way that’s visible to users of your API, and it’s not essential to the
domain, you’re almost certainly doing something wrong.</p>

<p>In this article I’m going to use two well-established C APIs to
demonstrate why global state is bad for APIs: BSD regular expressions
and POSIX Getopt.</p>

<h3 id="bsd-regular-expressions">BSD Regular Expressions</h3>

<p>The <a href="http://man7.org/linux/man-pages/man3/re_comp.3.html">BSD regular expression API</a> dates back to 4.3BSD, released
in 1986. It’s just a pair of functions: one compiles the regex, the
other executes it on a string.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">re_comp</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">regex</span><span class="p">);</span>
<span class="kt">int</span>   <span class="nf">re_exec</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">);</span>
</code></pre></div></div>

<p>It’s immediately obvious that there’s hidden internal state. Where
else would the resulting compiled regex object be? Also notice there’s
no <code class="language-plaintext highlighter-rouge">re_free()</code>, or similar, for releasing resources held by the
compiled result. That’s because, due to its limited design, it doesn’t
hold any. It’s entirely in static memory, which means there’s some
upper limit on the complexity of the regex given to this API. Suppose
an implementation <em>does</em> use dynamically allocated memory. It seems
this might not matter when only one compiled regex is allowed.
However, this would create warnings in Valgrind and make it harder to
use for bug testing.</p>

<p>This API is not thread-safe. Only one thread can use it at a time.
It’s not reentrant. While using a regex, calling another function that
might use a regex means you have to recompile when it returns, just in
case. The global state being entirely hidden, there’s no way to tell
if another part of the program used it.</p>

<h4 id="fixing-bsd-regular-expressions">Fixing BSD Regular Expressions</h4>

<p>This API has been deprecated for some time now, so hopefully no one’s
using it anymore. 15 years after the BSD regex API came out, POSIX
standardized <a href="http://man7.org/linux/man-pages/man3/regcomp.3.html">a much better API</a>. It operates on an opaque
<code class="language-plaintext highlighter-rouge">regex_t</code> object, on which all state is stored. There’s no global
state.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>    <span class="nf">regcomp</span><span class="p">(</span><span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">regex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">cflags</span><span class="p">);</span>
<span class="kt">int</span>    <span class="nf">regexec</span><span class="p">(</span><span class="k">const</span> <span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">string</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">size_t</span> <span class="nf">regerror</span><span class="p">(</span><span class="kt">int</span> <span class="n">errcode</span><span class="p">,</span> <span class="k">const</span> <span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">,</span> <span class="p">...);</span>
<span class="kt">void</span>   <span class="nf">regfree</span><span class="p">(</span><span class="n">regex_t</span> <span class="o">*</span><span class="n">preg</span><span class="p">);</span>
</code></pre></div></div>

<p>This is what a good API looks like.</p>

<h3 id="getopt">Getopt</h3>

<p>POSIX defines a C API called Getopt for parsing command line
arguments. It’s a single function that operates on the <code class="language-plaintext highlighter-rouge">argc</code> and
<code class="language-plaintext highlighter-rouge">argv</code> values provided to <code class="language-plaintext highlighter-rouge">main()</code>. An option string specifies which
options are valid and whether or not they require an argument. Typical
use looks like this,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">option</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">option</span> <span class="o">=</span> <span class="n">getopt</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">,</span> <span class="s">"ab:c:d"</span><span class="p">))</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">option</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">case</span> <span class="sc">'a'</span><span class="p">:</span>
            <span class="cm">/* ... */</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="cm">/* ... */</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">b</code> and <code class="language-plaintext highlighter-rouge">c</code> options require an argument, indicated by the colons.
When encountered, this argument is passed through a global variable
<code class="language-plaintext highlighter-rouge">optarg</code>. There are four external global variables in total.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">char</span> <span class="o">*</span><span class="n">optarg</span><span class="p">;</span>
<span class="k">extern</span> <span class="kt">int</span> <span class="n">optind</span><span class="p">,</span> <span class="n">opterr</span><span class="p">,</span> <span class="n">optopt</span><span class="p">;</span>
</code></pre></div></div>

<p>If an invalid option is found, <code class="language-plaintext highlighter-rouge">getopt()</code> will automatically print a
locale-specific error message and return <code class="language-plaintext highlighter-rouge">?</code>. The <code class="language-plaintext highlighter-rouge">opterr</code> variable
can be used to disable this message and the <code class="language-plaintext highlighter-rouge">optopt</code> variable is used
to get the actual invalid option character.</p>

<p>The <code class="language-plaintext highlighter-rouge">optind</code> variable keeps track of Getopt’s progress. It slides
along <code class="language-plaintext highlighter-rouge">argv</code> as each option is processed. In a minimal, strictly
POSIX-compliant Getopt, this is all the global state required.</p>

<p>The <code class="language-plaintext highlighter-rouge">argc</code> value in <code class="language-plaintext highlighter-rouge">main()</code>, and therefore the same parameter in
<code class="language-plaintext highlighter-rouge">getopt()</code>, is completely redundant and serves no real purpose. Just
like the C strings it points to, the <code class="language-plaintext highlighter-rouge">argv</code> vector is guaranteed to be
NULL-terminated. At best it’s a premature optimization.</p>

<h4 id="threading-an-reentrancy">Threading an Reentrancy</h4>

<p>The most immediate problem is that the entire program can only parse
one argument vector at a time. It’s not thread-safe. This leaves out
the possibility of parsing argument vectors in other threads. For
example, if the program is a server that exposes a shell-like
interface to remote users, and multiple threads are used to handle
those requests, it won’t be able to take advantage of Getopt.</p>

<p>The second problem is that, even in a single-threaded application, the
program can’t pause to parse a different argument vector before
returning. It’s not reentrant. For example, suppose one of the
arguments to the program is a string containing more arguments to be
parsed for some subsystem.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#  -s    Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"
</code></pre></div></div>

<p>In theory, the value of <code class="language-plaintext highlighter-rouge">optind</code> could be saved and restored. However,
this isn’t portable. POSIX doesn’t explicitly declare that the entire
state is captured by <code class="language-plaintext highlighter-rouge">optind</code>, nor is it required to be.
Implementations are allowed to have internal, hidden global state.
This has implications in resetting Getopt.</p>

<h4 id="resetting-getopt">Resetting Getopt</h4>

<p>In a minimal, strict Getopt, resetting Getopt for parsing another
argument vector is just a matter of setting <code class="language-plaintext highlighter-rouge">optind</code> to back to its
original value of 1. However, this idiom isn’t portable, and POSIX
provides no portable method for resetting the global parser state.</p>

<p>Real implementations of Getopt go beyond POSIX. Probably the most
popular extra feature is option grouping. Typically, multiple options
can be grouped into a single argument, so long as only the final
option requires an argument.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ myprogram -adb foo
</code></pre></div></div>

<p>After processing <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">optind</code> cannot be incremented, because it’s
still working on the first argument. This means there’s another
internal counter for stepping across the group. In glibc this is
called <code class="language-plaintext highlighter-rouge">nextchar</code>. Setting <code class="language-plaintext highlighter-rouge">optind</code> to 1 will not reset this internal
counter, nor would it be detectable by Getopt if it was already set
to 1. The glibc way to reset Getopt is to set <code class="language-plaintext highlighter-rouge">optind</code> to 0, which is
otherwise an invalid value. Some other Getopt implementations follow
this idiom, but it’s not entirely portable.</p>

<p>Not only does Getopt have nasty global state, the user has no way to
reliably control it!</p>

<h4 id="error-printing">Error Printing</h4>

<p>I mentioned that Getopt will automatically print an error message
unless disabled with <code class="language-plaintext highlighter-rouge">opterr</code>. There’s no way to get at this error
message, should you want to redirect it somewhere else. It’s more
hidden, internal state. You could write your own message, but you’d
lose out on the automatic locale support.</p>

<h4 id="fixing-getopt">Fixing Getopt</h4>

<p>The way Getopt <em>should</em> have been designed was to accept a context
argument and store all state on that context. Following other POSIX
APIs (pthreads, regex), the context itself would be an opaque object.
In typical use it would have automatic (i.e. stack) duration. The
context would either be zero initialized or a function would be
provided to initialize it. It might look something like this (in the
zero-initialized case).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">getopt</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="k">const</span> <span class="n">chat</span> <span class="o">*</span><span class="n">optstring</span><span class="p">);</span>
</code></pre></div></div>

<p>Instead of <code class="language-plaintext highlighter-rouge">optarg</code> and <code class="language-plaintext highlighter-rouge">optopt</code> global variables, these values would
be obtained by interrogating the context. The same applies for
<code class="language-plaintext highlighter-rouge">optind</code> and the diagnostic message.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_optarg</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="kt">int</span>         <span class="nf">getopt_optopt</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="kt">int</span>         <span class="nf">getopt_optind</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_opterr</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>Alternatively, instead of <code class="language-plaintext highlighter-rouge">getopt_optind()</code> the API could have a
function that continues processing, but returns non-option arguments
instead of options. It would return NULL when no more arguments are
left. This is the API I’d prefer, because it would allow for argument
permutation (allow options to come after non-options, per GNU Getopt)
without actually modifying the argument vector. This common extension
to Getopt could be added cleanly. The real Getopt isn’t designed well
for extension.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="nf">getopt_next_arg</span><span class="p">(</span><span class="n">getopt_t</span> <span class="o">*</span><span class="n">ctx</span><span class="p">);</span>
</code></pre></div></div>

<p>This API eliminates the global state and, as a result, solves <em>all</em> of
the problems listed above. It’s essentially the same API defined by
<a href="http://linux.die.net/man/3/popt">Popt</a> and my own embeddable <a href="https://github.com/skeeto/optparse">Optparse</a>. They’re much
better options if the limitations of POSIX-style Getopt are an issue.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>The Billion Pi Challenge</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/09/18/"/>
    <id>urn:uuid:4e6ada8b-9f8a-3ba6-03d4-5617a34c49f1</id>
    <updated>2014-09-18T02:32:01Z</updated>
    <category term="c"/><category term="math"/>
    <content type="html">
      <![CDATA[<p><em>The challenge</em>: As quickly as possible, find all occurrences of a
given sequence of digits in the first one billion digits of pi. You
<a href="https://stuff.mit.edu/afs/sipb/contrib/pi/">don’t have to compute pi yourself</a> for this challenge. For
example, “141592653” appears 4 times: at positions 1, 427,238,911,
570,434,346, and 678,096,434.</p>

<p>To my surprise, this turned out to be harder than I expected. A
straightforward scan with <a href="http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm">Boyer-Moore-Horspool</a> across the
entire text file is already pretty fast. On modern, COTS hardware it
takes about 6 seconds. Comparing bytes is cheap and it’s largely an
I/O-bound problem. This means building fancy indexes tends to make it
<em>slower</em> because it’s more I/O demanding.</p>

<p>The challenge was inspired by <a href="http://www.angio.net/pi/piquery.html">The Pi-Search Page</a>, which
offers a search on the first 200 million digits. There’s also a little
write-up about how their pi search works. I wanted to try to invent my
own solution. I did eventually come up with something that worked,
which can be found here. It’s written in plain old C.</p>

<ul>
  <li><a href="https://github.com/skeeto/pi-pattern">https://github.com/skeeto/pi-pattern</a></li>
</ul>

<p>You might want to give the challenge a shot on your own before
continuing!</p>

<h3 id="sqlite">SQLite</h3>

<p>The first thing I tried was SQLite. I thought an index (B-tree) over
fixed-length substrings would be efficient. A <code class="language-plaintext highlighter-rouge">LIKE</code> condition with a
right-hand wildcard is <a href="http://en.wikipedia.org/wiki/Sargable">sargable</a> and would work well with the
index. Here’s the schema I tried.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">digits</span>
<span class="p">(</span><span class="k">position</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span> <span class="n">sequence</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">)</span>
</code></pre></div></div>

<p>There will be 1 row for each position, i.e. 1 billion rows. Using
<code class="language-plaintext highlighter-rouge">INTEGER PRIMARY KEY</code> means <code class="language-plaintext highlighter-rouge">position</code> will be used directly for row
IDs, saving some database space.</p>

<p><em>After</em> the data has been inserted by sliding a window along pi, I
build an index. It’s better to build an index after data is in the
database than before.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">sequence_index</span> <span class="k">ON</span> <span class="n">digits</span> <span class="p">(</span><span class="n">sequence</span><span class="p">,</span> <span class="k">position</span><span class="p">)</span>
</code></pre></div></div>

<p>This takes several hours to complete. When it’s done the database is
<em>a whopping 60GB!</em> Remember I said that this is very much an I/O-bound
problem? I wasn’t kidding. This doesn’t work well at all. Here’s the
a search for the example sequence.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">position</span><span class="p">,</span> <span class="n">sequence</span> <span class="k">FROM</span> <span class="n">digits</span>
<span class="k">WHERE</span> <span class="n">sequence</span> <span class="k">LIKE</span> <span class="s1">'141592653%'</span>
</code></pre></div></div>

<p>You get your answers after about 15 minutes of hammering on the disk.</p>

<p>Sometime later I realized that up to 18-digits sequences could be
encoded into an integer, so that <code class="language-plaintext highlighter-rouge">TEXT</code> column could be a much simpler
<code class="language-plaintext highlighter-rouge">INTEGER</code>. Unfortunately this doesn’t really improve anything. I also
tried this in PostgreSQL but it was even worse. I gave up after 24
hours of waiting on it. These databases are not built for such long,
skinny tables, at least not without beefy hardware.</p>

<h3 id="offset-db">Offset DB</h3>

<p>A couple weeks later I had another idea. A query is just a sequence of
digits, so it can be trivially converted into a unique number. As
before, pick a fixed length for sequences (<code class="language-plaintext highlighter-rouge">n</code>) for the index and an
appropriate stride. The database would be one big file. To look up a
sequence, treat that sequence as an offset into the database and seek
into the database file to that offset times the stride. The total size
of the database is <code class="language-plaintext highlighter-rouge">10^n * stride</code>.</p>

<p>In this quick and dirty illustration, n=4 and stride=4 (far too small
for that n).</p>

<p><img src="/img/diagram/pi-stride.png" alt="" /></p>

<p>For example, if the fixed-length for sequences is 6 and the stride is
4,000 bytes, looking up “141592” is just a matter of seeking to byte
<code class="language-plaintext highlighter-rouge">141,592 * 4,000</code> and reading in positions until some sort of
sentinel. The stride must be long enough to store all the positions
for any indexed sequence.</p>

<p>For this purpose, the digits of pi are practically random numbers. The
good news is that it means a fixed stride will work well. Any
particular sequence appears just as often as any other. The chance a
specific n-length sequence begins at a specific position is <code class="language-plaintext highlighter-rouge">1 /
10^n</code>. There are 1 billion positions, so a particular sequence will
have <code class="language-plaintext highlighter-rouge">1e9 / 10^n</code> positions associated with it, which is a good place
to start for picking a stride.</p>

<p>The bad news is that building the index means jumping around the
database essentially at random for each write. This will break any
sort of cache between the program and the hard drive. It’s incredibly
slow, even mmap()ed. The workaround is to either do it entirely in RAM
(needs at least 6GB of RAM for 1 billion digits!) or to build it up
over many passes. I didn’t try it on an SSD but maybe the random
access is more tolerable there.</p>

<h4 id="adding-an-index">Adding an Index</h4>

<p>Doing all the work in memory makes it easier to improve the database
format anyway. It can be broken into an index section and a tables
section. Instead of a fixed stride for the data, front-load the
database with a similar index that points to the section (table) of
the database file that holds that sequence’s pi positions. Each of the
<code class="language-plaintext highlighter-rouge">10^n</code> positions gets a single integer in the index at the front of
the file. Looking up the positions for a sequence means parsing the
sequence as a number, seeking to that offset into the beginning of the
database, reading in another offset integer, and then seeking to that
new offset. Now the database is compact and there are no concerns
about stride.</p>

<p>No sentinel mark is needed either. The tables are concatenated in
order in the table part of the database. To determine where to stop,
take a peek at the <em>next</em> sequence’s start offset in the index. Its
table immediately follows, so this doubles as an end offset. For
convenience, one final integer in the index will point just beyond the
end of the database, so the last sequence (99999…) doesn’t require
special handling.</p>

<h4 id="searching-shorter-and-longer">Searching Shorter and Longer</h4>

<p>If the database built for fixed length sequences, how is a sequence of
a different length searched? The two cases, shorter and longer, are
handled differently.</p>

<p>If the sequence is <em>shorter</em>, fill in the remaining digits, …000 to
…999, and look up each sequence. For example, if n=6 and we’re
searching for “1415”, get all the positions for “141500”, “141501”,
“141502”, …, “141599” and concatenate them. Fortunately the database
already has them stored this way! Look up the offsets for “141500” and
“141600” and grab everything in between. The downside is that the pi
positions are only partially sorted, so they may require sorting
before presenting to the user.</p>

<p>If the sequence is <em>longer</em>, the original digits file will be needed.
Get the table for the subsequence fixed-length prefix, then seek into
the digits file checking each of the pi positions for a full match.
This requires lots of extra seeking, but a long sequence will
naturally have fewer positions to test. For example, if n=7 and we’re
looking for “141592653”, look up the “1415926” table in the database
and check each of its 106 positions.</p>

<p>With this database searches are only a few milliseconds (though very
much subject to cache misses). Here’s my program in action, from the
repository linked above.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time ./pipattern 141592653
1: 14159265358979323
427238911: 14159265303126685
570434346: 14159265337906537
678096434: 14159265360713718

real	0m0.004s
user	0m0.000s
sys	0m0.000s
</code></pre></div></div>

<p>I call that challenge completed!</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>C11 Lock-free Stack</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/09/02/"/>
    <id>urn:uuid:743811a4-aaf7-32e3-8a0c-62f1e8dbaf66</id>
    <updated>2014-09-02T03:10:01Z</updated>
    <category term="c"/><category term="tutorial"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>C11, the <a href="http://en.wikipedia.org/wiki/C11_(C_standard_revision)">latest C standard revision</a>, hasn’t received anywhere
near the same amount of fanfare as C++11. I’m not sure why this is.
Some of the updates to each language are very similar, such as formal
support for threading and atomic object access. Three years have
passed and some parts of C11 still haven’t been implemented by any
compilers or standard libraries yet. Since there’s not yet a lot of
discussion online about C11, I’m basing much of this article on my own
understanding of the <a href="https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 draft</a>. I <em>may</em> be under-using the
<code class="language-plaintext highlighter-rouge">_Atomic</code> type specifier and not paying enough attention to memory
ordering constraints.</p>

<p>Still, this is a good opportunity to break new ground with a
demonstration of C11. I’m going to use the new
<a href="http://en.cppreference.com/w/c/atomic"><code class="language-plaintext highlighter-rouge">stdatomic.h</code></a> portion of C11 to build a lock-free data
structure. To compile this code you’ll need a C compiler and C library
with support for both C11 and the optional <code class="language-plaintext highlighter-rouge">stdatomic.h</code> features. As
of this writing, as far as I know only <a href="https://gcc.gnu.org/gcc-4.9/changes.html">GCC 4.9</a>, released April
2014, supports this. It’s in Debian unstable but not in Wheezy.</p>

<p>If you want to take a look before going further, here’s the source.
The test code in the repository uses plain old pthreads because C11
threads haven’t been implemented by anyone yet.</p>

<ul>
  <li><a href="https://github.com/skeeto/lstack">https://github.com/skeeto/lstack</a></li>
</ul>

<p>I was originally going to write this article a couple weeks ago, but I
was having trouble getting it right. Lock-free data structures are
trickier and nastier than I expected, more so than traditional mutex
locks. Getting it right requires very specific help from the hardware,
too, so it won’t run just anywhere. I’ll discuss all this below. So
sorry for the long article. It’s just a lot more complex a topic than
I had anticipated!</p>

<h3 id="lock-free">Lock-free</h3>

<p>A lock-free data structure doesn’t require the use of mutex locks.
More generally, it’s a data structure that can be accessed from
multiple threads without blocking. This is accomplished through the
use of atomic operations — transformations that cannot be
interrupted. Lock-free data structures will generally provide better
throughput than mutex locks. And it’s usually safer, because there’s
no risk of getting stuck on a lock that will never be freed, such as a
deadlock situation. On the other hand there’s additional risk of
starvation (livelock), where a thread is unable to make progress.</p>

<p>As a demonstration, I’ll build up a lock-free stack, a sequence with
last-in, first-out (LIFO) behavior. Internally it’s going to be
implemented as a linked-list, so pushing and popping is O(1) time,
just a matter of consing a new element on the head of the list. It
also means there’s only one value to be updated when pushing and
popping: the pointer to the head of the list.</p>

<p>Here’s what the API will look like. I’ll define <code class="language-plaintext highlighter-rouge">lstack_t</code> shortly.
I’m making it an opaque type because its fields should never be
accessed directly. The goal is to completely hide the atomic
semantics from the users of the stack.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>     <span class="nf">lstack_init</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">max_size</span><span class="p">);</span>
<span class="kt">void</span>    <span class="nf">lstack_free</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
<span class="kt">size_t</span>  <span class="nf">lstack_size</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
<span class="kt">int</span>     <span class="nf">lstack_push</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">);</span>
<span class="kt">void</span>   <span class="o">*</span><span class="nf">lstack_pop</span> <span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">);</span>
</code></pre></div></div>

<p>Users can push void pointers onto the stack, check the size of the
stack, and pop void pointers back off the stack. Except for
initialization and destruction, these operations are all safe to use
from multiple threads. Two different threads will never receive the
same item when popping. No elements will ever be lost if two threads
attempt to push at the same time. Most importantly a thread will never
block on a lock when accessing the stack.</p>

<p>Notice there’s a maximum size declared at initialization time. While
<a href="http://www.research.ibm.com/people/m/michael/pldi-2004.pdf">lock-free allocation is possible</a> [PDF], C makes no
guarantees that <code class="language-plaintext highlighter-rouge">malloc()</code> is lock-free, so being truly lock-free
means not calling <code class="language-plaintext highlighter-rouge">malloc()</code>. An important secondary benefit to
pre-allocating the stack’s memory is that this implementation doesn’t
require the use of <a href="http://en.wikipedia.org/wiki/Hazard_pointer">hazard pointers</a>, which would be far more
complicated than the stack itself.</p>

<p>The declared maximum size should actually be the desired maximum size
plus the number of threads accessing the stack. This is because a
thread might remove a node from the stack and before the node can
freed for reuse, another thread attempts a push. This other thread
might not find any free nodes, causing it to give up without the stack
actually being “full.”</p>

<p>The <code class="language-plaintext highlighter-rouge">int</code> return value of <code class="language-plaintext highlighter-rouge">lstack_init()</code> and <code class="language-plaintext highlighter-rouge">lstack_push()</code> is for
error codes, returning 0 for success. The only way these can fail is
by running out of memory. This is an issue regardless of being
lock-free: systems can simply run out of memory. In the push case it
means the stack is full.</p>

<h3 id="structures">Structures</h3>

<p>Here’s the definition for a node in the stack. Neither field needs to
be accessed atomically, so they’re not special in any way. In fact,
the fields are <em>never</em> updated while on the stack and visible to
multiple threads, so it’s effectively immutable (outside of reuse).
Users never need to touch this structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">lstack_node</span> <span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Internally a <code class="language-plaintext highlighter-rouge">lstack_t</code> is composed of <em>two</em> stacks: the value stack
(<code class="language-plaintext highlighter-rouge">head</code>) and the free node stack (<code class="language-plaintext highlighter-rouge">free</code>). These will be handled
identically by the atomic functions, so it’s really a matter of
convention which stack is which. All nodes are initially placed on the
free stack and the value stack starts empty. Here’s what an internal
stack looks like.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">lstack_head</span> <span class="p">{</span>
    <span class="kt">uintptr_t</span> <span class="n">aba</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>There’s still no atomic declaration here because the struct is going
to be handled as an entire unit. The <code class="language-plaintext highlighter-rouge">aba</code> field is critically
important for correctness and I’ll go over it shortly. It’s declared
as a <code class="language-plaintext highlighter-rouge">uintptr_t</code> because it needs to be the same size as a pointer.
Now, this is not guaranteed by C11 — it’s only guaranteed to be large
enough to hold any valid <code class="language-plaintext highlighter-rouge">void *</code> pointer, so it could be even larger
— but this will be the case on any system that has the required
hardware support for this lock-free stack. This struct is therefore
the size of two pointers. If that’s not true for any reason, this code
will not link. Users will never directly access or handle this struct
either.</p>

<p>Finally, here’s the actual stack structure.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node_buffer</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">head</span><span class="p">,</span> <span class="n">free</span><span class="p">;</span>
    <span class="k">_Atomic</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
<span class="p">}</span> <span class="n">lstack_t</span><span class="p">;</span>
</code></pre></div></div>

<p>Notice the use of the new <code class="language-plaintext highlighter-rouge">_Atomic</code> qualifier. Atomic values may have
different size, representation, and alignment requirements in order to
satisfy atomic access. These values should never be accessed directly,
even just for reading (use <code class="language-plaintext highlighter-rouge">atomic_load()</code>).</p>

<p>The <code class="language-plaintext highlighter-rouge">size</code> field is for convenience to check the number of elements on
the stack. It’s accessed separately from the stack nodes themselves,
so it’s not safe to read <code class="language-plaintext highlighter-rouge">size</code> and use the information to make
assumptions about future accesses (e.g. checking if the stack is empty
before popping off an element). Since there’s no way to lock the
lock-free stack, there’s otherwise no way to estimate the size of the
stack during concurrent access without completely disassembling it via
<code class="language-plaintext highlighter-rouge">lstack_pop()</code>.</p>

<p>There’s <a href="https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt">no reason to use <code class="language-plaintext highlighter-rouge">volatile</code> here</a>. That’s a
separate issue from atomic operations. The C11 <code class="language-plaintext highlighter-rouge">stdatomic.h</code> macros
and functions will ensure atomic values are accessed appropriately.</p>

<h3 id="stack-functions">Stack Functions</h3>

<p>As stated before, all nodes are initially placed on the internal free
stack. During initialization they’re allocated in one solid chunk,
chained together, and pinned on the <code class="language-plaintext highlighter-rouge">free</code> pointer. The initial
assignments to atomic values are done through <code class="language-plaintext highlighter-rouge">ATOMIC_VAR_INIT</code>, which
deals with memory access ordering concerns. The <code class="language-plaintext highlighter-rouge">aba</code> counters don’t
<em>actually</em> need to be initialized. Garbage, indeterminate values are
just fine, but not initializing them would probably look like a
mistake.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lstack_init</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">max_size</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">head_init</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">};</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="n">head_init</span><span class="p">);</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>

    <span class="cm">/* Pre-allocate all nodes. */</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">max_size</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">lstack_node</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">max_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">next</span> <span class="o">=</span> <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span> <span class="o">+</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">[</span><span class="n">max_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">].</span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">free_init</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">};</span>
    <span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span> <span class="o">=</span> <span class="n">ATOMIC_VAR_INIT</span><span class="p">(</span><span class="n">free_init</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The free nodes will not necessarily be used in the same order that
they’re placed on the free stack. Several threads may pop off nodes
from the free stack and, as a separate operation, push them onto the
value stack in a different order. Over time with multiple threads
pushing and popping, the nodes are likely to get shuffled around quite
a bit. This is why a linked listed is still necessary even though
allocation is contiguous.</p>

<p>The reverse of <code class="language-plaintext highlighter-rouge">lstack_init()</code> is simple, and it’s assumed concurrent
access has terminated. The stack is no longer valid, at least not
until <code class="language-plaintext highlighter-rouge">lstack_init()</code> is used again. This one is declared <code class="language-plaintext highlighter-rouge">inline</code> and
put in the header.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">stack_free</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">free</span><span class="p">(</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">node_buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To read an atomic value we need to use <code class="language-plaintext highlighter-rouge">atomic_load()</code>. Give it a
pointer to an atomic value, it dereferences the pointer and returns
the value. This is used in another inline function for reading the
size of the stack.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">size_t</span>
<span class="nf">lstack_size</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">atomic_load</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="push-and-pop">Push and Pop</h4>

<p>For operating on the two stacks there will be two internal, static
functions, <code class="language-plaintext highlighter-rouge">push</code> and <code class="language-plaintext highlighter-rouge">pop</code>. These deal directly in nodes, accepting
and returning them, so they’re not suitable to expose in the API
(users aren’t meant to be aware of nodes). This is the most complex
part of lock-free stacks. Here’s <code class="language-plaintext highlighter-rouge">pop()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span>
<span class="nf">pop</span><span class="p">(</span><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="o">*</span><span class="n">head</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">next</span><span class="p">,</span> <span class="n">orig</span> <span class="o">=</span> <span class="n">atomic_load</span><span class="p">(</span><span class="n">head</span><span class="p">);</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">orig</span><span class="p">.</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
            <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>  <span class="c1">// empty stack</span>
        <span class="n">next</span><span class="p">.</span><span class="n">aba</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">aba</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">atomic_compare_exchange_weak</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">,</span> <span class="n">next</span><span class="p">));</span>
    <span class="k">return</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s centered around the new C11 <code class="language-plaintext highlighter-rouge">stdatomic.h</code> function
<code class="language-plaintext highlighter-rouge">atomic_compare_exchange_weak()</code>. This is an atomic operation more
generally called <a href="http://en.wikipedia.org/wiki/Compare-and-swap">compare-and-swap</a> (CAS). On x86 there’s an
instruction specifically for this, <code class="language-plaintext highlighter-rouge">cmpxchg</code>. Give it a pointer to the
atomic value to be updated (<code class="language-plaintext highlighter-rouge">head</code>), a pointer to the value it’s
expected to be (<code class="language-plaintext highlighter-rouge">orig</code>), and a desired new value (<code class="language-plaintext highlighter-rouge">next</code>). If the
expected and actual values match, it’s updated to the new value. If
not, it reports a failure and updates the expected value to the latest
value. In the event of a failure we start all over again, which
requires the <code class="language-plaintext highlighter-rouge">while</code> loop. This is an <em>optimistic</em> strategy.</p>

<p>The “weak” part means it will sometimes spuriously fail where the
“strong” version would otherwise succeed. In exchange for more
failures, calling the weak version is faster. Use the weak version
when the body of your <code class="language-plaintext highlighter-rouge">do ... while</code> loop is fast and the strong
version when it’s slow (when trying again is expensive), or if you
don’t need a loop at all. You usually want to use weak.</p>

<p>The alternative to CAS is <a href="http://en.wikipedia.org/wiki/Load-link/store-conditional">load-link/store-conditional</a>. It’s a
stronger primitive that doesn’t suffer from the ABA problem described
next, but it’s also not available on x86-64. On other platforms, one
or both of <code class="language-plaintext highlighter-rouge">atomic_compare_exchange_*()</code> will be implemented using
LL/SC, but we still have to code for the worst case (CAS).</p>

<h5 id="the-aba-problem">The ABA Problem</h5>

<p>The <code class="language-plaintext highlighter-rouge">aba</code> field is here to solve <a href="http://en.wikipedia.org/wiki/ABA_problem">the ABA problem</a> by counting
the number of changes that have been made to the stack. It will be
updated atomically alongside the pointer. Reasoning about the ABA
problem is where I got stuck last time writing this article.</p>

<p>Suppose <code class="language-plaintext highlighter-rouge">aba</code> didn’t exist and it was just a pointer being swapped.
Say we have two threads, A and B.</p>

<ul>
  <li>
    <p>Thread A copies the current <code class="language-plaintext highlighter-rouge">head</code> into <code class="language-plaintext highlighter-rouge">orig</code>, enters the loop body
to update <code class="language-plaintext highlighter-rouge">next.node</code> to <code class="language-plaintext highlighter-rouge">orig.node-&gt;next</code>, then gets preempted
before the CAS. The scheduler pauses the thread.</p>
  </li>
  <li>
    <p>Thread B comes along performs a <code class="language-plaintext highlighter-rouge">pop()</code> changing the value pointed
to by <code class="language-plaintext highlighter-rouge">head</code>. At this point A’s CAS will fail, which is fine. It
would reconstruct a new updated value and try again. While A is
still asleep, B puts the popped node back on the free node stack.</p>
  </li>
  <li>
    <p>Some time passes with A still paused. The freed node gets re-used
and pushed back on top of the stack, which is likely given that
nodes are allocated FIFO. Now <code class="language-plaintext highlighter-rouge">head</code> has its original value again,
but the <code class="language-plaintext highlighter-rouge">head-&gt;node-&gt;next</code> pointer is pointing somewhere completely
new! <em>This is very bad</em> because A’s CAS will now succeed despite
<code class="language-plaintext highlighter-rouge">next.node</code> having the wrong value.</p>
  </li>
  <li>
    <p>A wakes up and it’s CAS succeeds. At least one stack value has been
lost and at least one node struct was leaked (it will be on neither
stack, nor currently being held by a thread). This is the ABA
problem.</p>
  </li>
</ul>

<p>The core problem is that, unlike integral values, pointers have
meaning beyond their intrinsic numeric value. The meaning of a
particular pointer changes when the pointer is reused, making it
suspect when used in CAS. The unfortunate effect is that, <strong>by itself,
atomic pointer manipulation is nearly useless</strong>. They’ll work with
append-only data structures, where pointers are never recycled, but
that’s it.</p>

<p>The <code class="language-plaintext highlighter-rouge">aba</code> field solves the problem because it’s incremented every time
the pointer is updated. Remember that this internal stack struct is
two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16
bytes is compared by CAS and they all have to match for it to succeed.
Since B, or other threads, will increment <code class="language-plaintext highlighter-rouge">aba</code> at least twice (once
to remove the node, and once to put it back in place), A will never
mistake the recycled pointer for the old one. There’s a special
double-width CAS instruction specifically for this purpose,
<code class="language-plaintext highlighter-rouge">cmpxchg16</code>. This is generally called DWCAS. It’s available on most
x86-64 processors. On Linux you can check <code class="language-plaintext highlighter-rouge">/proc/cpuinfo</code> for support.
It will be listed as <code class="language-plaintext highlighter-rouge">cx16</code>.</p>

<p>If it’s not available at compile-time this program won’t link. The
function that wraps <code class="language-plaintext highlighter-rouge">cmpxchg16</code> won’t be there. You can tell GCC to
<em>assume</em> it’s there with the <code class="language-plaintext highlighter-rouge">-mcx16</code> flag. The same rule here applies
to C++11’s new std::atomic.</p>

<p>There’s still a tiny, tiny possibility of the ABA problem still
cropping up. On 32-bit systems A may get preempted for over 4 billion
(2^32) stack operations, such that the ABA counter wraps around to the
same value. There’s nothing we can do about this, but if you witness
this in the wild you need to immediately stop what you’re doing and go
buy a lottery ticket. Also avoid any lightning storms on the way to
the store.</p>

<h5 id="hazard-pointers-and-garbage-collection">Hazard Pointers and Garbage Collection</h5>

<p>Another problem in <code class="language-plaintext highlighter-rouge">pop()</code> is dereferencing <code class="language-plaintext highlighter-rouge">orig.node</code> to access its
<code class="language-plaintext highlighter-rouge">next</code> field. By the time we get to it, the node pointed to by
<code class="language-plaintext highlighter-rouge">orig.node</code> may have already been removed from the stack and freed. If
the stack was using <code class="language-plaintext highlighter-rouge">malloc()</code> and <code class="language-plaintext highlighter-rouge">free()</code> for allocations, it may
even have had <code class="language-plaintext highlighter-rouge">free()</code> called on it. If so, the dereference would be
undefined behavior — a segmentation fault, or worse.</p>

<p>There are three ways to deal with this.</p>

<ol>
  <li>
    <p>Garbage collection. If memory is automatically managed, the node
will never be freed as long as we can access it, so this won’t be a
problem. However, if we’re interacting with a garbage collector
we’re not really lock-free.</p>
  </li>
  <li>
    <p>Hazard pointers. Each thread keeps track of what nodes it’s
currently accessing and other threads aren’t allowed to free nodes
on this list. This is messy and complicated.</p>
  </li>
  <li>
    <p>Never free nodes. This implementation recycles nodes, but they’re
never truly freed until <code class="language-plaintext highlighter-rouge">lstack_free()</code>. It’s always safe to
dereference a node pointer because there’s always a node behind it.
It may point to a node that’s on the free list or one that was even
recycled since we got the pointer, but the <code class="language-plaintext highlighter-rouge">aba</code> field deals with
any of those issues.</p>
  </li>
</ol>

<p>Reference counting on the node won’t work here because we can’t get to
the counter fast enough (atomically). It too would require
dereferencing in order to increment. The reference counter could
potentially be packed alongside the pointer and accessed by a DWCAS,
but we’re already using those bytes for <code class="language-plaintext highlighter-rouge">aba</code>.</p>

<h5 id="push">Push</h5>

<p>Push is a lot like pop.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span>
<span class="nf">push</span><span class="p">(</span><span class="k">_Atomic</span> <span class="k">struct</span> <span class="n">lstack_head</span> <span class="o">*</span><span class="n">head</span><span class="p">,</span> <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_head</span> <span class="n">next</span><span class="p">,</span> <span class="n">orig</span> <span class="o">=</span> <span class="n">atomic_load</span><span class="p">(</span><span class="n">head</span><span class="p">);</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="n">node</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">node</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">aba</span> <span class="o">=</span> <span class="n">orig</span><span class="p">.</span><span class="n">aba</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">next</span><span class="p">.</span><span class="n">node</span> <span class="o">=</span> <span class="n">node</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">atomic_compare_exchange_weak</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">orig</span><span class="p">,</span> <span class="n">next</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s counter-intuitive, but adding a <a href="http://blog.memsql.com/common-pitfalls-in-writing-lock-free-algorithms/">few microseconds of
sleep</a> after CAS failures would probably <em>increase</em>
throughput. Under high contention, threads wouldn’t take turns
clobbering each other as fast as possible. It would be a bit like
exponential backoff.</p>

<h4 id="api-push-and-pop">API Push and Pop</h4>

<p>The API push and pop functions are built on these internal atomic
functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">lstack_push</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ENOMEM</span><span class="p">;</span>
    <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="p">;</span>
    <span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span>
    <span class="n">atomic_fetch_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Push removes a node from the free stack. If the free stack is empty it
reports an out-of-memory error. It assigns the value and pushes it
onto the value stack where it will be visible to other threads.
Finally, the stack size is incremented atomically. This means there’s
an instant where the stack size is listed as one shorter than it
actually is. However, since there’s no way to access both the stack
size and the stack itself at the same instant, this is fine. The stack
size is really only an estimate.</p>

<p>Popping is the same thing in reverse.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">lstack_pop</span><span class="p">(</span><span class="n">lstack_t</span> <span class="o">*</span><span class="n">lstack</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">lstack_node</span> <span class="o">*</span><span class="n">node</span> <span class="o">=</span> <span class="n">pop</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">head</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="n">atomic_fetch_sub</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">value</span> <span class="o">=</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">value</span><span class="p">;</span>
    <span class="n">push</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lstack</span><span class="o">-&gt;</span><span class="n">free</span><span class="p">,</span> <span class="n">node</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Remove the top node, subtract the size estimate atomically, put the
node on the free list, and return the pointer. It’s really simple with
the primitive push and pop.</p>

<h3 id="sha1-demo">SHA1 Demo</h3>

<p>The lstack repository linked at the top of the article includes a demo
that searches for patterns in SHA-1 hashes (sort of like Bitcoin
mining). It fires off one worker thread for each core and the results
are all collected into the same lock-free stack. It’s not <em>really</em>
exercising the library thoroughly because there are no contended pops,
but I couldn’t think of a better example at the time.</p>

<p>The next thing to try would be implementing a C11, bounded, lock-free
queue. It would also be more generally useful than a stack,
particularly for common consumer-producer scenarios.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Stabilizing C's Quicksort</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/08/29/"/>
    <id>urn:uuid:078a1683-1484-3532-d231-5c5e805fc9a6</id>
    <updated>2014-08-29T18:39:01Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>The C standard library includes a quicksort function called <code class="language-plaintext highlighter-rouge">qsort()</code>.
It sorts homogeneous arrays of arbitrary type. The interface is exactly
what you’d expect given the constraints of the language.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">qsort</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nmemb</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span>
           <span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">compar</span><span class="p">)(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</code></pre></div></div>

<p>It takes a pointer to the first element of the array, the number of
members, the size of each member, and a comparator function. The
comparator has to operate on <code class="language-plaintext highlighter-rouge">void *</code> pointers because C doesn’t have
templates or generics or anything like that. That’s two interfaces
where type safety is discarded: the arguments passed to <code class="language-plaintext highlighter-rouge">qsort()</code> and
again when it calls the comparator function.</p>

<p>One of the significant flaws of this interface is the lack of context
for the comparator. C doesn’t have closures, which in other languages
would cover this situation. If the sort function depends on some
additional data, such as in <a href="http://en.wikipedia.org/wiki/Graham_scan">Graham scan</a> where points are
sorted relative to a selected point, the extra information needs to be
smuggled in through a global variable. This is not reentrant and
wouldn’t be safe in a multi-threaded environment. There’s a GNU
extension here, <code class="language-plaintext highlighter-rouge">qsort_r()</code>, that takes an additional context
argument, allowing for reentrant comparators.</p>

<p>Quicksort has some really nice properties. It’s in-place, so no
temporary memory needs to be allocated. If implemented properly it
only consumes O(log n) space, which is the stack growth during
recursion. Memory usage is localized, so it plays well with caching.</p>

<p>That being said, <code class="language-plaintext highlighter-rouge">qsort()</code> is also a classic example of an API naming
mistake. <a href="http://calmerthanyouare.org/2013/05/31/qsort-shootout.html">Few implementations actually use straight
quicksort</a>. For example, glibc’s <code class="language-plaintext highlighter-rouge">qsort()</code> is merge sort (in
practice), and the other major libc implementations use a <a href="http://en.wikipedia.org/wiki/Timsort">hybrid
approach</a>. Programs using their language’s sort function
shouldn’t be concerned with how it’s implemented. All the matters is
the interface and whether or not it’s a stable sort. OpenBSD made the
exact same mistake when they introduced <a href="http://www.openbsd.org/cgi-bin/man.cgi?query=arc4random&amp;sektion=3"><code class="language-plaintext highlighter-rouge">arc4random()</code></a>,
which <a href="http://marc.info/?l=openbsd-cvs&amp;m=138065251627052&amp;w=2">no longer uses RC4</a>.</p>

<p>Since quicksort is an unstable sort — there are multiple possible
results when the array contains equivalent elements — this means
<code class="language-plaintext highlighter-rouge">qsort()</code> is not guaranteed to be stable, even if internally the C
library <em>is</em> using a stable sort like merge sort. The C standard
library has no stable sort function.</p>

<h3 id="comparator-composability">Comparator Composability</h3>

<p>The unfortunate side effect of unstable sorts is that they hurt
composability. For example, let’s say we have a <code class="language-plaintext highlighter-rouge">person</code> struct like
this,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">person</span> <span class="p">{</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">first</span><span class="p">,</span> <span class="o">*</span><span class="n">last</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">age</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Here are a couple of comparators to sort either by name or by age. As
a side note, <code class="language-plaintext highlighter-rouge">strcmp()</code> automatically works correctly with UTF-8 so
this program isn’t limited to old-fashioned ASCII names.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">compare_name</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pa</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pb</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">b</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">last</span> <span class="o">=</span> <span class="n">strcmp</span><span class="p">(</span><span class="n">pa</span><span class="o">-&gt;</span><span class="n">last</span><span class="p">,</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">last</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">last</span> <span class="o">!=</span> <span class="mi">0</span> <span class="o">?</span> <span class="n">last</span> <span class="o">:</span> <span class="n">strcmp</span><span class="p">(</span><span class="n">pa</span><span class="o">-&gt;</span><span class="n">first</span><span class="p">,</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">first</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">compare_age</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pa</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pb</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pa</span><span class="o">-&gt;</span><span class="n">age</span> <span class="o">-</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">age</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And since we’ll need it later, here’s a <code class="language-plaintext highlighter-rouge">COUNT_OF</code> macro to get the
length of arrays at compile time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define COUNT_OF(x) (sizeof(x) / sizeof(0[x]))
</span></code></pre></div></div>

<p>Say we want to sort by name, <em>then</em> age. When using a stable sort,
this is accomplished by sorting on each field separately in reverse
order of preference: a composition of individual comparators. Here’s
an attempt at using quicksort to sort an array of people by age then
name.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">person</span> <span class="n">people</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">{</span><span class="s">"Joe"</span><span class="p">,</span> <span class="s">"Shmoe"</span><span class="p">,</span> <span class="mi">24</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"John"</span><span class="p">,</span> <span class="s">"Doe"</span><span class="p">,</span> <span class="mi">30</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"Alan"</span><span class="p">,</span> <span class="s">"Smithee"</span><span class="p">,</span> <span class="mi">42</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"Jane"</span><span class="p">,</span> <span class="s">"Doe"</span><span class="p">,</span> <span class="mi">30</span><span class="p">}</span>
<span class="p">};</span>

<span class="n">qsort</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">COUNT_OF</span><span class="p">(</span><span class="n">people</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">person</span><span class="p">),</span> <span class="n">compare_name</span><span class="p">);</span>
<span class="n">qsort</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">COUNT_OF</span><span class="p">(</span><span class="n">people</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">person</span><span class="p">),</span> <span class="n">compare_age</span><span class="p">);</span>
</code></pre></div></div>

<p>But this doesn’t always work. J<strong>a</strong>ne should come before J<strong>o</strong>hn,
but the original sort was completely lost in the second sort.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Joe Shmoe, 24
John Doe, 30
Jane Doe, 30
Alan Smithee, 42
</code></pre></div></div>

<p>This could be fixed by defining a new comparator that operates on both
fields at once, <code class="language-plaintext highlighter-rouge">compare_age_name()</code>, and performing a single sort.
But what if later you want to sort by name then age? Now you need
<code class="language-plaintext highlighter-rouge">compare_name_age()</code>. If a third field was added, there would need to
be 6 (3!) different comparator functions to cover all the
possibilities. If you had 6 fields, you’d need 720 comparators!
Composability has been lost to a combinatorial nightmare.</p>

<h3 id="pointer-comparison">Pointer Comparison</h3>

<p>The GNU libc documentation <a href="http://www.gnu.org/software/libc/manual/html_node/Array-Sort-Function.html">claims that <code class="language-plaintext highlighter-rouge">qsort()</code> can be made
stable</a> by using pointer comparison as a fallback. That is, when
the relevant fields are equivalent, use their array position to
resolve the difference.</p>

<blockquote>
  <p>If you want the effect of a stable sort, you can get this result by
writing the comparison function so that, lacking other reason
distinguish between two elements, it compares them by their
addresses.</p>
</blockquote>

<p>This is not only false, it’s dangerous! Because elements may be sorted
in-place, even in glibc, their position will change during the sort.
The comparator will be using their <em>current</em> positions, not the
starting positions. What makes it dangerous is that the comparator
will return different orderings throughout the sort as elements are
moved around the array. This could result in an infinite loop, or
worse.</p>

<h3 id="making-it-stable">Making it Stable</h3>

<p>The most direct way to work around the unstable sort is to eliminate
any equivalencies. Equivalent elements can be distinguished by adding
an intrusive <code class="language-plaintext highlighter-rouge">order</code> field which is set after each sort. The
comparators will fall back on this sort to maintain the original
ordering.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">person</span> <span class="p">{</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">first</span><span class="p">,</span> <span class="o">*</span><span class="n">last</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">age</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">order</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>And the new comparators.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">compare_name_stable</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pa</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pb</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">b</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">last</span> <span class="o">=</span> <span class="n">strcmp</span><span class="p">(</span><span class="n">pa</span><span class="o">-&gt;</span><span class="n">last</span><span class="p">,</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">last</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">last</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">last</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">first</span> <span class="o">=</span> <span class="n">strcmp</span><span class="p">(</span><span class="n">pa</span><span class="o">-&gt;</span><span class="n">first</span><span class="p">,</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">first</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">first</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">first</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">pa</span><span class="o">-&gt;</span><span class="n">order</span> <span class="o">-</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">order</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">compare_age_stable</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pa</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="n">pb</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">person</span> <span class="o">*</span><span class="p">)</span> <span class="n">b</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">age</span> <span class="o">=</span> <span class="n">pa</span><span class="o">-&gt;</span><span class="n">age</span> <span class="o">-</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">age</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">age</span> <span class="o">!=</span> <span class="mi">0</span> <span class="o">?</span> <span class="n">age</span> <span class="o">:</span> <span class="n">pa</span><span class="o">-&gt;</span><span class="n">order</span> <span class="o">-</span> <span class="n">pb</span><span class="o">-&gt;</span><span class="n">order</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The first sort doesn’t need to be stable, but there’s not much reason
to keep around two definitions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">qsort</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">COUNT_OF</span><span class="p">(</span><span class="n">people</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">people</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">compare_name_stable</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">COUNT_OF</span><span class="p">(</span><span class="n">people</span><span class="p">);</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
    <span class="n">people</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">order</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="n">qsort</span><span class="p">(</span><span class="n">people</span><span class="p">,</span> <span class="n">COUNT_OF</span><span class="p">(</span><span class="n">people</span><span class="p">),</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">people</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">compare_age_stable</span><span class="p">);</span>
</code></pre></div></div>

<p>And the result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Joe Shmoe, 24
Jane Doe, 30
John Doe, 30
Alan Smithee, 42
</code></pre></div></div>

<p>Without defining any new comparators I can sort by name then age just
by swapping the calls to <code class="language-plaintext highlighter-rouge">qsort()</code>. At the cost of an extra
bookkeeping field, the number of comparator functions needed as fields
are added is O(n) and not O(n!) despite using an unstable sort.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Fast Monte Carlo Method with JavaScript</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2013/02/25/"/>
    <id>urn:uuid:0208230e-3f57-334e-5d57-7a18f3794288</id>
    <updated>2013-02-25T00:00:00Z</updated>
    <category term="emacs"/><category term="lisp"/><category term="c"/><category term="javascript"/>
    <content type="html">
      <![CDATA[<blockquote>
  <p>How many times should a random number from <code class="language-plaintext highlighter-rouge">[0, 1]</code> be drawn to have
it sum over 1?</p>
</blockquote>

<p>If you want to figure it out for yourself, stop reading now and come
back when you’re done.</p>

<p><a href="http://bayesianthink.blogspot.com/2013/02/the-expected-draws-to-sum-over-one.html">The answer</a> is <em>e</em>. When I came across this question I took
the lazy programmer route and, rather than work out the math, I
estimated the answer using the Monte Carlo method. I used the language
I always use for these scratchpad computations: Emacs Lisp. All I need
to do is switch to the <code class="language-plaintext highlighter-rouge">*scratch*</code> buffer and start hacking. No
external program needed.</p>

<p>The downside is that Elisp is incredibly slow. Fortunately, Elisp is
so similar to Common Lisp that porting to it is almost trivial. My
preferred Common Lisp implementation, SBCL, is very, very fast so it’s
a huge speed upgrade with little cost, should I need it. As far as I
know, SBCL is the fastest Common Lisp implementation.</p>

<p>Even though Elisp was fast enough to determine that the answer is
probably <em>e</em>, I wanted to play around with it. This little test
program doubles as a way to estimate the value of <em>e</em>,
<a href="http://math.fullerton.edu/mathews/n2003/montecarlopimod.html">similar to estimating <em>pi</em></a>. The more trial runs I give it the
more accurate my answer will get — to a point.</p>

<p>Here’s the Common Lisp version. (I love the loop macro, obviously.)</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">trial</span> <span class="p">()</span>
  <span class="p">(</span><span class="nb">loop</span> <span class="nv">for</span> <span class="nb">count</span> <span class="nv">upfrom</span> <span class="mi">1</span>
     <span class="nv">sum</span> <span class="p">(</span><span class="nb">random</span> <span class="mf">1.0</span><span class="p">)</span> <span class="nv">into</span> <span class="nv">total</span>
     <span class="nv">until</span> <span class="p">(</span><span class="nb">&gt;</span> <span class="nv">total</span> <span class="mi">1</span><span class="p">)</span>
     <span class="nv">finally</span> <span class="p">(</span><span class="nb">return</span> <span class="nb">count</span><span class="p">)))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">monte-carlo</span> <span class="p">(</span><span class="nv">n</span><span class="p">)</span>
  <span class="p">(</span><span class="nb">loop</span> <span class="nv">repeat</span> <span class="nv">n</span>
     <span class="nv">sum</span> <span class="p">(</span><span class="nv">trial</span><span class="p">)</span> <span class="nv">into</span> <span class="nv">total</span>
     <span class="nv">finally</span> <span class="p">(</span><span class="nb">return</span> <span class="p">(</span><span class="nb">/</span> <span class="nv">total</span> <span class="mf">1.0</span> <span class="nv">n</span><span class="p">))))</span>
</code></pre></div></div>

<p>Using SBCL 1.0.57.0.debian on an Intel Core i7-2600 CPU, once
everything’s warmed up this takes about 9.4 seconds with 100 million
trials.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(time (monte-carlo 100000000))
Evaluation took:
  9.423 seconds of real time
  9.388587 seconds of total run time (9.380586 user, 0.008001 system)
  99.64% CPU
  31,965,834,356 processor cycles
  99,008 bytes consed
2.7185063
</code></pre></div></div>

<p>Since this makes for an interesting benchmark I gave it a whirl in
JavaScript,</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">trial</span><span class="p">()</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="nx">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="nx">sum</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">sum</span> <span class="o">+=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">random</span><span class="p">();</span>
        <span class="nx">count</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nx">count</span><span class="p">;</span>
<span class="p">}</span>

<span class="kd">function</span> <span class="nx">monteCarlo</span><span class="p">(</span><span class="nx">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">total</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o">&lt;</span> <span class="nx">n</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">total</span> <span class="o">+=</span> <span class="nx">trial</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nx">total</span> <span class="o">/</span> <span class="nx">n</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I ran this on Chromium 24.0.1312.68 Debian 7.0 (180326) which uses V8,
currently the fastest JavaScript engine. With 100 million trials,
<strong>this only took about 2.7 seconds</strong>!</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">monteCarlo</span><span class="p">(</span><span class="mi">100000000</span><span class="p">);</span> <span class="c1">// ~2.7 seconds, according to Skewer</span>
<span class="c1">// =&gt; 2.71850356</span>
</code></pre></div></div>

<p>Whoa! It beat SBCL! I was shocked. Let’s try using C as a
baseline. Surely C will be the fastest.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">trial</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">sum</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">sum</span> <span class="o">+=</span> <span class="n">rand</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span> <span class="n">RAND_MAX</span><span class="p">;</span>
        <span class="n">count</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">count</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">double</span> <span class="nf">monteCarlo</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">i</span><span class="p">,</span> <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">total</span> <span class="o">+=</span> <span class="n">trial</span><span class="p">();</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">total</span> <span class="o">/</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span> <span class="n">n</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">monteCarlo</span><span class="p">(</span><span class="mi">100000000</span><span class="p">));</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I used the highest optimization setting on the compiler.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -ansi -W -Wall -Wextra -O3 temp.c
$ time ./a.out
2.718359

real	0m3.782s
user	0m3.760s
sys	0m0.000s
</code></pre></div></div>

<p>Incredible! <strong>JavaScript was faster than C!</strong> That was completely
unexpected.</p>

<h3 id="the-circumstances">The Circumstances</h3>

<p>Both the Common Lisp and C code could probably be carefully tweaked to
improve performance. In Common Lisp’s case I could attach type
information and turn down safety. For C I could use more compiler
flags to squeeze out a bit more performance. Then <em>maybe</em> they could
beat JavaScript.</p>

<p>In contrast, as far as I can tell the JavaScript code is already as
optimized as it can get. There just aren’t many knobs to tweak. Note
that minifying the code will make no difference, especially since I’m
not measuring the parsing time. Except for the functions themselves,
the variables are all local, so they are never “looked up” at
run-time. Their name length doesn’t matter. Remember, in JavaScript
<em>global</em> variables are expensive, because they’re (generally) hash
table lookups on the global object at run-time. For any decent
compiler, local variables are basically precomputed memory offsets —
very fast.</p>

<p>The function names themselves are global variables, but the V8
compiler appears to eliminate this cost (inlining?). Wrapping the
entire thing in another function, turning the two original functions
into local variables, makes no difference in performance.</p>

<p>While Common Lisp and C <em>may</em> be able to beat JavaScript if time is
invested in optimizing them — something to be done rarely — in a
casual implementation of this algorithm, JavaScript beats them both. I
find this really exciting.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Revisiting an N-body Simulator</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/09/19/"/>
    <id>urn:uuid:4e450cd9-5541-3ff4-0c15-351a77926825</id>
    <updated>2012-09-19T00:00:00Z</updated>
    <category term="c"/><category term="video"/>
    <content type="html">
      <![CDATA[<p>Ten years ago I was a high school senior taking my second year of
physics. Having recently reviewed vectors and gravity, as well as
being an avid Visual Basic programmer at the time, I decided to create
my own n-body simulation. I recently came across this old project and
fortunately (since I can no longer compile it) I had left a compiled
version with the source code. Here it is in Wine,</p>

<p><img src="/img/screenshot/galsim.png" alt="" /></p>

<p>Really, it’s really not worth downloading but I’m putting a link here
for my own archival purposes.</p>

<ul>
  <li><a href="https://nullprogram.s3.amazonaws.com/galaxy/galsim.zip">galsim.zip</a> (712 kB)</li>
</ul>

<p>I didn’t quite understand what I was doing so I screwed up the
math. All the vector computations were done independently. Integration
was done by Euler method — a sin I continue to commit regularly to
this day but now I’m at least aware of the limitations. Despite this,
it was still accurate enough to look interesting.</p>

<p>Probably the most advanced thing to come out of it, and something I
<em>did</em> do correctly, was the display. I worked out my own graphics
engine to project three-dimensional star coordinates onto the
two-dimensional drawing surface, re-inventing perspective projection.</p>

<p>As I said, I recently came across it again while digging around my
digital archives. Now that I’m a professional developer I wondered how
much faster I could do the same thing with just a few hours of
coding. I did it in C and my implementation was about an order of
magnitude faster. Not as much as I hoped, but it’s something!</p>

<ul>
  <li><a href="https://gist.github.com/3204862">https://gist.github.com/3204862</a></li>
</ul>

<p>It’s still Euler method integration, the bodies are still point
masses, and there are no collisions so there’s numerical instability
when they get close. However, I did get the vector math right! My goal
was to make something that looked interesting rather than an accurate
simulation, so all of this is alright.</p>

<p>I only wrote the simulation, not a display. To display the output I
just had GNU Octave plot it for me, which I turned into videos. This
first video is a static view of the origin of the coordinate
system. If you watch (or skip) all the way to the end you’ll see that
the galaxy drifts out of view. This is due to a bias in the random
number generator — the galaxy’s mass was lopsided.</p>

<video src="https://nullprogram.s3.amazonaws.com/galaxy/attempt-1.webm" controls="controls" width="480" height="360">
  Video requires WebM support with HTML5.
</video>

<p>After seeing this drift I added dynamic pan and zoom, so that the
camera follows the action. It’s a bit excessive at the beginning (the
camera is <em>too</em> dynamic) and the end (the camera is too far out).</p>

<video src="https://nullprogram.s3.amazonaws.com/galaxy/attempt-2.webm" controls="controls" width="480" height="360">
  Video requires WebM support with HTML5.
</video>

<p>I bit more tweaking of the galaxy start state (normal distribution,
adding initial velocities) and the camera and I got this interesting
result. The galaxy initially bunches into two globs, which then merge.</p>

<video src="https://nullprogram.s3.amazonaws.com/galaxy/v10-z9.webm" controls="controls" width="480" height="360">
  Video requires WebM support with HTML5.
</video>

<p>I wouldn’t have bothered with a post about this but I think these
videos turned out to be interesting.</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Few Tricky C Questions</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/07/30/"/>
    <id>urn:uuid:29f59a9d-86b9-387d-60bf-16417ea9888c</id>
    <updated>2012-07-30T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>At work I recently came across an abandoned copy of the first edition
of <em>The C Programming Language</em> by Brian Kernighan and Dennis Ritchie —
often lovingly abbreviated as <em>K&amp;R</em>. It’s a
<a href="http://en.wikipedia.org/wiki/The_C_Programming_Language">significant piece of computer science history</a>
and I highly recommend it to anyone who writes software. As far as
computing manuals go, it’s a thin book (228 pages) so I got through
the whole thing in about a week.</p>

<p>I’ve been programming in C for seven years now but it seems there’s
always something new for me to learn about it. The book cleared up
some incomplete concepts I had about C, particularly the relationship
between pointers and arrays as well as operator precedence — the
reason why function pointers look so weird. By the end I re-gained an
appreciation for the simplicity and power of C. All of the examples in
the book are written without heap allocation (no <code class="language-plaintext highlighter-rouge">malloc()</code>), just
static memory, and it manages to get by with rather few limitations.</p>

<p>As I was reading I realized a handful of “tricky” questions that I
wouldn’t have been able to answer with confidence before reading the
book. If you’re a C developer, pause and reflect just after each chunk
of example code and try to answer the question as correctly as you
can. Pretend you’re a compiler and think about what you need to do in
each situation.</p>

<h3 id="register-variables">Register variables</h3>

<p>What is the output of this program?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">register</span> <span class="kt">int</span> <span class="n">foo</span><span class="p">;</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">foo</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">register</code> keyword hints to the compiler that the automatic
variable should be stored in a register rather than memory, making
access to the variable faster. This is only a hint so the compiler is
free to ignore it.</p>

<p>In the example we take a pointer to the variable. However, we declared
this variable to be stored in a register. Addresses only point to
locations in memory so registers can’t be addressed by a
pointer. While the compiler can ignore the optimization hint and
provide an address, this is ultimately an inconsistent request. The
compiler will produce and error and the code will not compile.</p>

<h3 id="pointers-to-struct-fields">Pointers to struct fields</h3>

<p>Is this program valid?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>
<span class="p">}</span> <span class="n">baz</span><span class="p">;</span>

<span class="kt">int</span> <span class="o">*</span><span class="nf">example</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">baz</span><span class="p">.</span><span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><del>Here we’re creating a struct called <code class="language-plaintext highlighter-rouge">baz</code> and take a pointer to one of
its fields. According to K&amp;R C, this is invalid.</del> (<strong>Update</strong>: I
misunderstood. This is allowed.) Overall, structs are really limited in
K&amp;R C: they can’t be function arguments, nor can they be returned from
functions, <del>nor can pointers be taken to their fields</del>. Only
<em>pointers</em> to structs are first-class. They acknowledged that this was
limiting and said they planned on fixing it in the future.</p>

<p>Fortunately, this <em>was</em> fixed with ANSI C and structs are first-class
objects. This means the above program is <strong>valid</strong> in ANSI C.</p>

<p>How about this one?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">foo</span> <span class="o">:</span> <span class="mi">4</span><span class="p">;</span>
<span class="p">}</span> <span class="n">baz</span><span class="p">;</span>

<span class="kt">int</span> <span class="o">*</span><span class="nf">example</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="o">&amp;</span><span class="n">baz</span><span class="p">.</span><span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">foo</code> field is a 4-bit wide bit-field — smaller than a single
byte. Pointers can only address whole bytes, so this is
<strong>invalid</strong>. Even if <code class="language-plaintext highlighter-rouge">foo</code> was 8 or 32 bits wide (full/aligned bytes
on modern architectures) this would still be invalid.</p>

<h3 id="pointer-arithmetic">Pointer arithmetic</h3>

<p>We want to average two pointers to get a pointer in-between them. Is
this reasonable code?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span> <span class="o">*</span><span class="nf">foo</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">start</span> <span class="o">=</span> <span class="s">"hello"</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">end</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">5</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">start</span> <span class="o">+</span> <span class="n">end</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>A thoughtful programmer should notice that adding together pointers is
likely to be disastrous. Pointers tend to be very large, addressing
high areas of memory. Adding two pointers together is very likely to
lead to an overflow. When I posed this question to
<a href="http://www.50ply.com/">Brian</a>, he realized this and came up with this
solution to avoid the overflow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">return</span> <span class="n">start</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">end</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
</code></pre></div></div>

<p>However, this is still <strong>invalid</strong>. As a complete precaution for
overflowing pointer arithmetic, pointer addition is forbidden and
neither of these will compile. Pointer subtraction is perfectly valid,
so it can be done like so.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">return</span> <span class="p">(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">start</span><span class="p">;</span>
</code></pre></div></div>

<p>Subtracting two pointers produces an integer. Adding <em>integers</em> to
pointers is not only valid but also essential, so this is only a
restriction about adding <em>pointers</em> together.</p>

<h3 id="arrays">Arrays</h3>

<p>Is this valid?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">hello</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"hello"</span><span class="p">;</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">foo</span> <span class="o">=</span> <span class="n">hello</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">hello</code> is an array of <code class="language-plaintext highlighter-rouge">char</code>s and <code class="language-plaintext highlighter-rouge">foo</code> is a pointer to a <code class="language-plaintext highlighter-rouge">char</code>. In
general, arrays are interchangeable with pointers of the same type so
this is <strong>valid</strong>. Now how about this one?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">foo</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">char</span> <span class="n">hello</span><span class="p">[</span><span class="mi">6</span><span class="p">];</span>
    <span class="kt">char</span> <span class="o">*</span><span class="n">foo</span> <span class="o">=</span> <span class="s">"hello"</span><span class="p">;</span>
    <span class="n">hello</span> <span class="o">=</span> <span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here we’ve inverted the relationship are are trying to assign the
array as a pointer. This is <strong>invalid</strong>. Arrays are like pointer
constants in that they can’t be used as <em>lvalues</em> — they can’t be
reassigned to point to somewhere else. The closest you can get is to
<em>copy</em> the contents of <code class="language-plaintext highlighter-rouge">foo</code> into <code class="language-plaintext highlighter-rouge">hello</code>.</p>

<p>I think that about sums my questions. I (foolishly) didn’t write them
down as I came up with them and this is everything I can remember.</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Perlin Noise With Octave, Java, and OpenCL</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/06/03/"/>
    <id>urn:uuid:830cc950-634a-3661-135a-b932c8c5399e</id>
    <updated>2012-06-03T00:00:00Z</updated>
    <category term="java"/><category term="c"/><category term="octave"/><category term="video"/>
    <content type="html">
      <![CDATA[<p>I recently discovered that I’m an idiot and that my
<a href="/blog/2007/11/20/">old Perlin noise post</a> was not actually describing
Perlin noise at all, but fractional Brownian motion. Perlin noise is
slightly more complicated but much more powerful. To learn the correct
algorithm, I wrote three different implementations
(<a href="https://github.com/skeeto/perlin-noise">perlin-noise</a>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone git://github.com/skeeto/perlin-noise.git
</code></pre></div></div>

<p>In short, Perlin noise is based on a grid of randomly-generated
gradient vectors which describe how the arbitrarily-dimensional
“surface” is sloped at that point. The noise at the grid points is
always 0, though you’d never know it. When sampling the noise at some
point between grid points, a weighted interpolation of the surrounding
gradient vectors is calculated. Vectors are reduced to a single noise
value by dot product.</p>

<p>Rather than waste time trying to explain it myself, I’ll link to an
existing, great tutorial: <a href="https://web.archive.org/web/20150304163452/http://webstaff.itn.liu.se/~stegu/TNM022-2005/perlinnoiselinks/perlin-noise-math-faq.html">The Perlin noise math FAQ</a>. There’s
also the original presentation by Ken Perlin, <a href="(http://www.noisemachine.com/talk1/)">Making Noise</a>,
which is more concise but harder to grok.</p>

<p>When making my own implementation, I started by with Octave. It’s my
“go to language” for creating a prototype when I’m doing something
with vectors or matrices since it has the most concise syntax for
these things. I wrote a two-dimensional generator and it turned out to
be a lot simpler than I thought it would be!</p>

<ul>
  <li><a href="https://github.com/skeeto/perlin-noise/blob/master/octave/perlin2d.m">perlin2d.m</a></li>
</ul>

<p>Because it’s 2D, there are four surrounding grid points to consider
and these are all hard-coded. This leads to an interesting property:
there are no loops. The code is entirely vectorized, which makes it
quite fast. It actually keeps up with my generalized Java solution
(next) when given a grid of points, such as from <code class="language-plaintext highlighter-rouge">meshgrid()</code>.</p>

<p>The grid gradient vectors are generated on the fly by a hash
function. The integer x and y positions of the point are hashed using
a bastardized version of Robert Jenkins’ 96 bit mix function (the one
I used in my <a href="/blog/2011/06/13/">infinite parallax starfield</a>) to
produce a vector. This turned out to be the trickiest part to write,
because any weaknesses in the hash function become very apparent in
the resulting noise.</p>

<p>Using Octave, this took two seconds to generate on my laptop. You
can’t really tell by looking at it, but, as with all Perlin noise,
there is actually a grid pattern.</p>

<p><img src="/img/noise/octave-perlin2d.png" alt="" /></p>

<p>I then wrote a generalized version, <code class="language-plaintext highlighter-rouge">perlin.m</code>, that can generate
arbitrarily-dimensional noise. This one is a lot shorter, but it’s not
vectorized, can only sample one point at a time, and is incredibly
slow. For a hash function, I use Octave’s <code class="language-plaintext highlighter-rouge">hashmd5()</code>, so this one
won’t work in Matlab (which provides no hash function
whatsoever). However, it <em>is</em> a lot shorter!</p>

<div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">%% Returns the Perlin noise value for an arbitrary point.</span>
<span class="k">function</span> <span class="n">v</span> <span class="o">=</span> <span class="n">perlin</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
  <span class="n">v</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="c1">%% Iterate over each corner</span>
  <span class="k">for</span> <span class="n">dirs</span> <span class="o">=</span> <span class="p">[</span><span class="nb">dec2bin</span><span class="p">(</span><span class="mi">0</span><span class="p">:(</span><span class="mi">2</span> <span class="o">^</span> <span class="nb">length</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="mi">48</span><span class="p">]</span><span class="o">'</span>
    <span class="n">q</span> <span class="o">=</span> <span class="nb">floor</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="o">+</span> <span class="n">dirs</span><span class="s1">'; % This iteration'</span><span class="n">s</span> <span class="n">corner</span>
    <span class="n">g</span> <span class="o">=</span> <span class="n">qgradient</span><span class="p">(</span><span class="n">q</span><span class="p">);</span> <span class="c1">% This corner's gradient</span>
    <span class="n">m</span> <span class="o">=</span> <span class="nb">dot</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">p</span> <span class="o">-</span> <span class="n">q</span><span class="p">);</span>
    <span class="n">t</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="nb">abs</span><span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">q</span><span class="p">);</span>
    <span class="n">v</span> <span class="o">+=</span> <span class="n">m</span> <span class="o">*</span> <span class="nb">prod</span><span class="p">(</span><span class="mi">3</span> <span class="o">*</span> <span class="n">t</span> <span class="o">.^</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">t</span> <span class="o">.^</span> <span class="mi">3</span><span class="p">);</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="c1">%% Return the gradient at the given grid point.</span>
<span class="k">function</span> <span class="n">v</span> <span class="o">=</span> <span class="n">qgradient</span><span class="p">(</span><span class="n">q</span><span class="p">)</span>
  <span class="n">v</span> <span class="o">=</span> <span class="nb">zeros</span><span class="p">(</span><span class="nb">size</span><span class="p">(</span><span class="n">q</span><span class="p">));</span>
  <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">:</span><span class="nb">length</span><span class="p">(</span><span class="n">q</span><span class="p">);</span>
      <span class="n">v</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">=</span> <span class="n">hashmd5</span><span class="p">([</span><span class="n">i</span> <span class="n">q</span><span class="p">])</span> <span class="o">*</span> <span class="mf">2.0</span> <span class="o">-</span> <span class="mf">1.0</span><span class="p">;</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>It took Octave an entire day to generate this “fire” video, which is
ridiculously long. An old graphics card could probably do this in real
time.</p>

<video src="https://nullprogram.s3.amazonaws.com/noise/fire.webm" width="300" height="300" controls="controls">
  Your browser doesn't support HTML5 video with WebM. :-(
</video>

<p>This was produced by viewing a slice of 3D noise. For animation, the
viewing area moves in two dimensions (z and y). One dimension makes
the fire flicker, the other makes it look like it’s rising. A simple
gradient was applied to the resulting noise to fade away towards the
top.</p>

<p>I wanted to achieve this same effect faster, so next I made a
generalized Java implementation, which is the bulk of the
repository. I wrote my own Vector class (completely unlike Java’s
depreciated Vector but more like Apache Commons Math’s RealVector), so
it looks very similar to the Octave version. It’s much, much faster
than the generalized Octave version. It doesn’t use a hash function
for gradients — instead randomly generating them as needed and
keeping track of them for later with a Map.</p>

<p>I wanted to go faster yet, so next I looked at OpenCL for the first
time. OpenCL is an API that allows you to run C-like programs on your
graphics processing unit (GPU), among other things. I was sticking to
Java so I used <a href="http://www.lwjgl.org/">lwjgl</a>’s OpenCL bindings. In
order to use this code you’ll need an OpenCL implementation available
on your system, which, unfortunately, is usually proprietary. My
OpenCL noise generator only generates 3D noise.</p>

<p>Why use the GPU? GPUs have a highly-parallel structure that makes them
faster than CPUs at processing large blocks of data in parallel. This
is really important when it comes to computer graphics, but it can be
useful for other purposes as well, like generating Perlin noise.</p>

<p>I had to change my API a little to make this effective. Before, to
generate noise samples, I passed points in individually to
PerlinNoise. To properly parallelize this for OpenCL, an entire slice
is specified by setting its width, height, step size, and
z-level. This information, along with pre-computed grid gradients, is
sent to the GPU.</p>

<ul>
  <li><a href="https://github.com/skeeto/perlin-noise/blob/opencl/src/com/nullprogram/noise/perlin3d.cl">perlin3d.cl</a></li>
</ul>

<p>This is all in the <code class="language-plaintext highlighter-rouge">opencl</code> branch in the repository. When run, it
will produce a series of slices of 3D noise in a manner similar to the
fire example above. For comparison, it will use the CPU by default,
generating a series of <code class="language-plaintext highlighter-rouge">simple-*.png</code>. Give the program one argument,
“opencl”, and it will use OpenCL instead, generating a series of
<code class="language-plaintext highlighter-rouge">opencl-*.png</code>. You should notice a massive increase in speed when
using OpenCL. In fact, it’s even faster than this. The vast majority
of the time is spent creating these output PNG images. When I disabled
image output for both, OpenCL was 200 times faster than the
(single-core) CPU implementation, still spending a significant amount
of time just loading data off the GPU.</p>

<p>And finally, I turned the OpenCL output into a video,</p>

<video src="https://nullprogram.s3.amazonaws.com/noise/opencl.webm" width="400" height="400" controls="controls">
  Your browser doesn't support HTML5 video with WebM. :-(
</video>

<p>That’s pretty cool!</p>

<p>I still don’t really have a use for Perlin noise, especially not under
constraints that require I use OpenCL to generate it. The big thing I
got out of this project was my first experience with OpenCL, something
that really <em>is</em> useful at work.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Pseudo-terminals</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/04/23/"/>
    <id>urn:uuid:269799fd-3a67-3a22-433a-c5224447e614</id>
    <updated>2012-04-23T00:00:00Z</updated>
    <category term="c"/><category term="posix"/>
    <content type="html">
      <![CDATA[<p>My dad recently had an interesting problem at work related to serial
ports. Since I use serial ports at work, he asked me for advice. They
have third-party software which reads and analyzes sensor data from
the serial port. It’s the only method this program has of inputting a
stream of data and they’re unable to patch it. Unfortunately, they
have another piece of software that needs to massage the data before
this final program gets it. The data needs to be intercepted coming on
the serial port somehow.</p>

<p><img src="/img/diagram/pseudo-terminals.png" alt="" /></p>

<p>The solution they were aiming for was to create a pair of virtual
serial ports. The filter software would read data in on the real
serial port, output the filtered data into a virtual serial port which
would be virtually connected to a second virtual serial port. The
analysis software would then read from this second serial port. They
couldn’t figure out how to set this up, short of buying a couple of
USB/serial port adapters and plugging them into each other.</p>

<p>It turns out this is very easy to do on Unix-like systems. POSIX
defines two functions, <code class="language-plaintext highlighter-rouge">posix_openpt(3)</code> and <code class="language-plaintext highlighter-rouge">ptsname(3)</code>. The first
one creates a pseudo-terminal — a virtual serial port — and returns
a “master” <em>file descriptor</em> used to talk to it. The second provides
the name of the pseudo-terminal device on the filesystem, usually
named something like <code class="language-plaintext highlighter-rouge">/dev/pts/5</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define _GNU_SOURCE
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">posix_openpt</span><span class="p">(</span><span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_NOCTTY</span><span class="p">);</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd</span><span class="p">));</span>
    <span class="cm">/* ... read and write to fd ... */</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The printed device name can be opened by software that’s expecting to
access a serial port, such as
<a href="http://en.wikipedia.org/wiki/Minicom">minicom</a>, and it can be
communicated with as if by a pipe. This could be useful in testing a
program’s serial port communication logic virtually.</p>

<p>The reason for the unusually long name is because the function wasn’t
added to POSIX until 1998 (Unix98). They were probably afraid of name
collisions with software already using <code class="language-plaintext highlighter-rouge">openpt()</code> as a function
name. The GNU C Library provides an extension <code class="language-plaintext highlighter-rouge">getpt(3)</code>, which is
just shorthand for the above.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
</code></pre></div></div>

<p>Pseudo-terminal functionality was available much earlier, of
course. It could be done through the poorly designed <code class="language-plaintext highlighter-rouge">openpty(3)</code>,
added in BSD Unix.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">openpty</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">amaster</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">aslave</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span>
            <span class="k">const</span> <span class="k">struct</span> <span class="n">termios</span> <span class="o">*</span><span class="n">termp</span><span class="p">,</span>
            <span class="k">const</span> <span class="k">struct</span> <span class="n">winsize</span> <span class="o">*</span><span class="n">winp</span><span class="p">);</span>
</code></pre></div></div>

<p>It accepts <code class="language-plaintext highlighter-rouge">NULL</code> for the last three arguments, allowing the user to
ignore them. What makes it so bad is that string <code class="language-plaintext highlighter-rouge">name</code>. The user
would pass it a chunk of allocated space and hope it was long enough
for the file name. If not, <code class="language-plaintext highlighter-rouge">openpty()</code> would overwrite the end of the
string and trash some memory. It’s highly unlikely to ever exceed
something like 32 bytes, but it’s still a correctness problem.</p>

<p>The newer <code class="language-plaintext highlighter-rouge">ptsname()</code> is only slightly better however. It returns a
string that doesn’t need to be <code class="language-plaintext highlighter-rouge">free()</code>d, because it’s static
memory. However, that means the function is not re-entrant; it has
issues in multi-threaded programs, since that string could be trashed
at any instant by another call to <code class="language-plaintext highlighter-rouge">ptsname()</code>. Consider this case,</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">fd0</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">fd1</span> <span class="o">=</span> <span class="n">getpt</span><span class="p">();</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd0</span><span class="p">),</span> <span class="n">ptsname</span><span class="p">(</span><span class="n">fd1</span><span class="p">));</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">ptsname()</code> will be returning the same <code class="language-plaintext highlighter-rouge">char *</code> pointer each time it’s
called, merely filling the pointed-to space before returning. Rather
than printing two different device filenames, the above would print
the same filename twice. The GNU C Library provides an extension to
correct this flaw, as <code class="language-plaintext highlighter-rouge">ptsname_r()</code>, where the user provides the
memory as before but also indicates its maximum size.</p>

<p>To make a one-way virtual connection between our pseudo-terminals,
create two of them and do the typical buffer thing between the file
descriptors (for succinctness, no checking for errors),</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">char</span> <span class="n">buffer</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">in</span> <span class="o">=</span> <span class="n">read</span><span class="p">(</span><span class="n">pt0</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buffer</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">write</span><span class="p">(</span><span class="n">pt1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buffer</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Making a two-way connection would require the use of threads or
<code class="language-plaintext highlighter-rouge">select(2)</code>, but it wouldn’t be much more complicated.</p>

<p>While all this was new and interesting to me, it didn’t help my dad at
all because they’re using Windows. These functions don’t exist there
and creating virtual serial ports is a highly non-trivial,
less-interesting process. Buying the two adapters and connecting them
together is my recommended solution for Windows.</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Pth and ncurses</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2010/12/21/"/>
    <id>urn:uuid:6164246d-d4c0-3a3f-7842-86cbdad69b87</id>
    <updated>2010-12-21T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<!-- 21 December 2010 -->
<p>
I've learned how to use the
curses/<a href="http://www.gnu.org/software/ncurses/">ncurses</a>
library recently, and I decided to experiment with threading while
using ncurses. This posed another learning opportunity: a chance to
learn more about <a href="http://www.gnu.org/software/pth/">GNU
Pth</a>, a non-preemptive, userspace threading library. It's a really
cool trick to get threading into any C program on any platform. I've
used POSIX threads, Pthreads, before, so Pth is totally new to me.
</p>
<p>
The idea was this: a ticking timestamp string that I can move around
the screen with the arrow keys. One thread will keep the clock up to
date, and the other listens for user input and changes the clock
coordinates.
</p>
<p>
The ncurses function for getting a key from the user
is <code>getch()</code>. By default, it blocks waiting for the user to
press a key, which is returned. Unfortunately, this doesn't interact
with Pth well at all.
</p>
<p>
Pth threads are userspace threads rather than system threads. This
means the operating system kernel is completely unaware of them, so
they are managed by a userspace scheduler and the Pth threads all run
inside a single system thread. Because of this, Pth threads can never
take advantage of hardware parallelism. This disadvantage comes with
the advantage of portability (hence the name "portable threads"). It
can be used on systems that provide no threading support.
</p>
<p>
Pth threads are also non-preemptive. This means the thread currently in
control must eventually choose to yield control to other threads,
cooperating with them. Preemptive threads <i>take</i> control from
each other, so they never have to yield. Fortunately, the Pth library
sneaks in implicit yielding, so you usually don't have to worry about
this when using Pth. You can generally treat the Pth threads as if
they were preemptive.
</p>
<p>
As I was saying, Pth wasn't behaving well with ncurses. When I
called <code>getch()</code> my entire program, all threads, were
getting blocked too, which defeats the whole purpose of threading. I
switched to Pthreads to see if it was a mistake on my part, or an
issue with Pth. Pthreads was working just fine, so I had to figure out
what I was doing wrong with Pth.
</p>
<p>
I did manage to get Pth to behave the same was as the Pthreads, but it
took two significant extra changes. Here's a code listing. There's
also a mutex synchronized <code>draw_clock()</code> function not shown
here. I've written it so that I could easily switch back and forth
between the two threading libraries.
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;ncurses.h&gt;</span><span class="cp">
#if USE_PTH
#include</span> <span class="cpf">&lt;pth.h&gt;</span><span class="cp">
#else
#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;pthread.h&gt;</span><span class="cp">
#endif
</span>
<span class="kt">int</span> <span class="n">clockx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">clocky</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">run_clock</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="n">draw_clock</span> <span class="p">();</span>
      <span class="n">sleep</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>
    <span class="p">}</span>
  <span class="k">return</span> <span class="n">v</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="cm">/* ncurses init */</span>
  <span class="n">initscr</span> <span class="p">();</span>
  <span class="n">noecho</span> <span class="p">();</span>
  <span class="n">cbreak</span> <span class="p">();</span>
  <span class="n">keypad</span> <span class="p">(</span><span class="n">stdscr</span><span class="p">,</span> <span class="n">TRUE</span><span class="p">);</span>

<span class="cp">#if USE_PTH
</span>  <span class="n">halfdelay</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>
  <span class="n">pth_init</span> <span class="p">();</span>
  <span class="n">pth_spawn</span> <span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">run_clock</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="cp">#else
</span>  <span class="n">pthread_t</span> <span class="kr">thread</span><span class="p">;</span>
  <span class="n">pthread_create</span> <span class="p">(</span><span class="o">&amp;</span><span class="kr">thread</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">run_clock</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="cp">#endif
</span>
  <span class="kt">int</span> <span class="n">ch</span><span class="p">;</span>
  <span class="k">while</span> <span class="p">((</span><span class="n">ch</span> <span class="o">=</span> <span class="n">getch</span> <span class="p">())</span> <span class="o">!=</span> <span class="sc">'q'</span><span class="p">)</span>
    <span class="p">{</span>
<span class="cp">#if USE_PTH
</span>      <span class="n">pth_yield</span> <span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="cp">#endif
</span>      <span class="k">switch</span> <span class="p">(</span><span class="n">ch</span><span class="p">)</span>
        <span class="p">{</span>
        <span class="k">case</span> <span class="n">ERR</span><span class="p">:</span>
          <span class="k">continue</span><span class="p">;</span>
        <span class="k">case</span> <span class="sc">'h'</span><span class="p">:</span>
        <span class="k">case</span> <span class="n">KEY_LEFT</span><span class="p">:</span>
        <span class="cm">/* more switch code here. */</span>
        <span class="p">}</span>
    <span class="p">}</span>

<span class="cp">#if USE_PTH
</span>  <span class="n">pth_kill</span> <span class="p">();</span>
<span class="cp">#endif
</span>  <span class="n">endwin</span> <span class="p">();</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>
Notice the two (bolded) differences in the code between Pth and
Pthreads, in the <code>USE_PTH</code> sections. The Pth implementation
uses the <code>halfdelay()</code> function, which
tells <code>getch()</code> to be less blocking. The argument tells it
to return with an error if nothing happens within one tenth of a
second. This means our main polling loop will execute about 10 times a
second when nothing is happening.
</p>
<p>
The second change is an explicit yield, because Pth doesn't place any
implicit yields in the loop. Without the yield the same problem
remains.
</p>
<p>
I'm not happy with either of these
changes. Making <code>getch()</code> behave like non-blocking input is
very hackish. If I extend my program I'll always have to be careful
when I call <code>getch()</code>, since it's (essentially)
non-blocking. I'll also have to make sure I always yield when polling
with <code>getch()</code>. So how do I fix this? First, let's see why
we need that explicit yield. Pth is supposed to be hiding that from
me.
</p>
<p>
How does Pth insert implicit yielding? I've been aware of Pth for
years, and I've always wondered about this, but never bothered to
look. I dug into the Pth sources.
</p>
<p>
Pth inserts yields before some of the common blocking operations. It
has its own definitions for functions such as <code>read()</code>,
<code>write()</code>, <code>fork()</code>,
and <code>system()</code>. This is where the implicit yielding is
injected. It steals your calls to these functions, using its own
functions. Here's the relevant section of <code>pth.h</code>, where
the "soft" version of this happens (the "hard" version
uses <code>syscall()</code>, operating at a lower level),
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#define fork          pth_fork
#define waitpid       pth_waitpid
#define system        pth_system
#define nanosleep     pth_nanosleep
#define usleep        pth_usleep
#define sleep         pth_sleep
#define sigprocmask   pth_sigmask
#define sigwait       pth_sigwait
#define select        pth_select
#define pselect       pth_pselect
#define poll          pth_poll
#define connect       pth_connect
#define accept        pth_accept
#define read          pth_read
#define write         pth_write
#define readv         pth_readv
#define writev        pth_writev
#define recv          pth_recv
#define send          pth_send
#define recvfrom      pth_recvfrom
#define sendto        pth_sendto
#define pread         pth_pread
#define pwrite        pth_pwrite</span></code></pre></figure>

<p>
Its own functions wrap the real deal, but suspend themselves on an
awaiting event and yield. Any time the scheduler runs, it polls for
these events, using <code>select()</code> in the case of input/output,
and will wake these threads back up if the required event occurs.
</p>
<p>
The problem with <code>getch()</code> is that Pth doesn't know about
it, so it doesn't get a chance to handle it properly. After taking a
look at the implementation for <code>pth_read()</code>, I
fixed <code>getch()</code> for my program,
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#if USE_PTH
</span><span class="kt">int</span> <span class="nf">pth_getch</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="n">pth_event_t</span> <span class="n">ev</span><span class="p">;</span>
  <span class="n">ev</span> <span class="o">=</span> <span class="n">pth_event</span> <span class="p">(</span><span class="n">PTH_EVENT_FD</span> <span class="o">|</span> <span class="n">PTH_UNTIL_FD_READABLE</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  <span class="n">pth_wait</span> <span class="p">(</span><span class="n">ev</span><span class="p">);</span>
  <span class="k">return</span> <span class="n">getch</span> <span class="p">();</span>
<span class="p">}</span>
<span class="cp">#undef getch
#define getch pth_getch
#endif</span></code></pre></figure>

<p>
It tells the Pth scheduler that the thread wants to be suspended until
there's something to read on <code>stdin</code> (file descriptor
<code>0</code>). This prevents the system thread that everyone is
counting on from blocking. With this redefinition I went back and
removed the two Pth additions, <code>halfdelay()</code> and the yield,
and it now behaves exactly the same way as the Pthreads version. Fixed!
</p>
<p>
If you use any other libraries, you'll need to do this for any
long-blocking function that Pth doesn't already catch.
</p>
<p>
If you want to see this in action, here's the full
source: <a href="/download/clock.c"><code>clock.c</code></a>. You can
choose the threading library to use,
</p>
<pre>
gcc -lncurses -lpth -DUSE_PTH clock.c -o clock_pth
gcc -lncurses -pthread clock.c -o clock_pthread
</pre>
<p>
Update: I've just discovered that Debian and Debian-based
systems <a href="http://lists.gnupg.org/pipermail/gnupg-devel/2006-November/023372.html">
have implicit yeilding disabled</a> by default. You may have needed to
add this before the <code>pth.h</code> include.
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#define PTH_SYSCALL_SOFT 1</span></code></pre></figure>

</pre>
<p>
This is now in the linked source.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A Not So Stupid C Mistake</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2009/04/20/"/>
    <id>urn:uuid:2137866c-6d55-35e5-77a0-8cbd154e6b71</id>
    <updated>2009-04-20T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<!-- 20 April 2009 -->
<p>
I was reading through a website of
"<a href="http://www.rinkworks.com/stupid/cs_programming.shtml">computer
stupidities</a>" today when I came across this,
</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="k">if</span> <span class="p">(</span><span class="n">a</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="cm">/* do something */</span>
    <span class="k">return</span> <span class="n">x</span><span class="p">;</span>
  <span class="p">}</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="o">!</span><span class="n">a</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="cm">/* do something else */</span>
    <span class="k">return</span> <span class="n">y</span><span class="p">;</span>
  <span class="p">}</span>
<span class="k">else</span>
  <span class="p">{</span>
    <span class="cm">/* do something entirely different */</span>
    <span class="k">return</span> <span class="n">z</span><span class="p">;</span>
  <span class="p">}</span></code></pre></figure>
<p>
This was quickly dismissed as being an obvious beginner mistake. I
don't think this can be dismissed so quickly without thinking it
through for a moment. Yes, in the example above we will never reach
the last condition where we <code>return z</code>, but consider the
following,
</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="k">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">&lt;</span> <span class="n">b</span><span class="p">)</span>
  <span class="n">printf</span> <span class="p">(</span><span class="s">"foo</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">&gt;</span> <span class="n">b</span><span class="p">)</span>
  <span class="n">printf</span> <span class="p">(</span><span class="s">"bar</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">==</span> <span class="n">b</span><span class="p">)</span>
  <span class="n">printf</span> <span class="p">(</span><span class="s">"baz</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">else</span>
  <span class="nf">printf</span> <span class="p">(</span><span class="s">"faz</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span></code></pre></figure>
<p>
The same quick dismissal might drop the last "faz" print statement as
being an impossible condition. Can you think of a situation where the
program would print "faz"?
</p>
<p>
Our final condition will be reached if <code>a</code> or
<code>b</code> is equal to <code>NAN</code>, which is defined by the
<a href="http://en.wikipedia.org/wiki/IEEE_floating-point_standard">
IEEE floating-point standard</a>. It is available in C99 from
<code>math.h</code>. A <code>NAN</code> in any of the comparisons
above will evaluate to false.
</p>
<p>
So don't be so quick to dismiss code like this.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Sudoku Solver</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2008/07/20/"/>
    <id>urn:uuid:11f2985b-375e-39d4-7510-1daf1b97cfea</id>
    <updated>2008-07-20T00:00:00Z</updated>
    <category term="c"/><category term="game"/>
    <content type="html">
      <![CDATA[<!-- 20 July 2008 -->
<p>
I was at my fiancee's parent's house over Fourth of July weekend. Her
family likes to leave plenty of reading material right by the toilet,
which is something fairly new to me. They take their time on the john
quite seriously.
</p>
<p>
While I was in there I saw a large book
of <a href="http://en.wikipedia.org/wiki/Sudoku"> Sudoku</a>
puzzles. Since the toilet is a good spot to think (I like to call it
my
"<a href="http://everything-more.blogspot.com/2008/04/that-t-shirt-is-wrong-color.html">
thinking chair</a>"), I thought out an algorithm for solving
Sudokus. I then left the bathroom and implemented it in order to
verify that it worked.
</p>
<p>
The method is trial-and-error, which it does recursively: fill in the
next available spot with a valid number as defined by the rules
(cannot have the same number in a column, row, or partition), and
recurse. The function reports success (true) when a solution was
found, or failure (false), which means we try the next available
number. If no more valid numbers are available for testing at the
current position, then the puzzle is not solvable (we made an error at
a previous position), so we stop recursing and return failure.
</p>
<p>
More formally,
</p>
<ul>
  <li>Find an open position.</li>
  <li>Look at that position's row, column, and partition to find valid
  numbers to fill in.</li>
  <li>Fill the position with one of the valid choices.</li>
  <li>Recurse using the new change.</li>
  <li>If the recursion reports a problem (returns false), try the next
  valid number and repeat.</li>
  <li>If recursion reports success (true), stop guessing and return
  success.</li>
  <li>If the list of valid numbers is exhausted, return failure (false).</li>
</ul>
<p>
Note that the recursion depth does not exceed 81, as it only recurses
once per blank square. The "game tree" is broad rather than deep. It
doesn't have to duplicate the puzzle matrix in memory either because
all operations can be done in place.
</p>
<p>
Here is the implementation in C I typed up just after I left the
bathroom,
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">int</span> <span class="nf">solve</span><span class="p">(</span><span class="kt">char</span> <span class="n">matrix</span><span class="p">[</span><span class="mi">9</span><span class="p">][</span><span class="mi">9</span><span class="p">])</span>
<span class="p">{</span>
    <span class="cm">/* Find an empty spot. */</span>
    <span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">9</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">s</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">9</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">s</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">matrix</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">x</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span> <span class="n">y</span> <span class="o">=</span> <span class="n">j</span><span class="p">;</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>

    <span class="cm">/* No empty spots, we found a solution! */</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">s</span><span class="p">)</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>

    <span class="cm">/* Determine legal numbers for this spot. */</span>
    <span class="kt">char</span> <span class="n">nums</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span> <span class="p">};</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">j</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">9</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="n">nums</span><span class="p">[(</span><span class="kt">int</span><span class="p">)</span> <span class="n">matrix</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>   <span class="cm">/* Vertically */</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">9</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
        <span class="n">nums</span><span class="p">[(</span><span class="kt">int</span><span class="p">)</span> <span class="n">matrix</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>   <span class="cm">/* Horizontally */</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
            <span class="n">nums</span><span class="p">[(</span><span class="kt">int</span><span class="p">)</span> <span class="n">matrix</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">x</span> <span class="o">/</span> <span class="mi">3</span><span class="p">))</span> <span class="o">*</span> <span class="mi">3</span><span class="p">]</span>
                             <span class="p">[</span><span class="n">j</span> <span class="o">+</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">y</span> <span class="o">/</span> <span class="mi">3</span><span class="p">))</span> <span class="o">*</span> <span class="mi">3</span><span class="p">]</span>
                <span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>          <span class="cm">/* Within the partition */</span>

    <span class="cm">/* Try each possible number and recurse for each. */</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nums</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">matrix</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">solve</span><span class="p">(</span><span class="n">matrix</span><span class="p">))</span>
                <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
        <span class="p">}</span>

    <span class="cm">/* Each attempt failed: reset this position and report failure. */</span>
    <span class="n">matrix</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>
I assumed that it would be slow solving the puzzles, having to search
a wide tree, but it turns out to be very fast. It solves normal
human-solvable puzzles in a couple of milliseconds. Wikipedia has a
near-worst case Sudoku that is designed to make algorithms like mine
perform their worst.
</p>
<p>
  <img src="/img/sudoku/worst-case.png" alt="Worst-case Sudoku"/>
</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">char</span> <span class="n">worst_case</span><span class="p">[</span><span class="mi">9</span><span class="p">][</span><span class="mi">9</span><span class="p">]</span> <span class="o">=</span>
  <span class="p">{</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>

    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>

    <span class="p">{</span><span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">3</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">},</span>
    <span class="p">{</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>   <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">9</span><span class="p">}</span>
  <span class="p">};</span></code></pre></figure>
<p>
On my laptop, my program solves this in 15 seconds, which means that
it should take no more than 15 seconds to solve any given Sudoku
puzzle. This provides me a nice upper limit.
</p>
<p>
There is a way to "defeat" this particular puzzle. For example, say an
attacker was trying to perform a
<a href="http://en.wikipedia.org/wiki/Denial-of-service_attack">
denial-of-service</a> (DoS) attack on your Sudoku solver by giving it
puzzles like this one (making your server spend lots of time solving
only a few puzzles). However, these puzzles assume a certain guessing
order. By simply randomizing the order of guessing, both in choosing
positions and the order that numbers are guessed, the attacker will
have a much harder time creating a difficult puzzle. The worst case
could very well be the best case. This is very similar to how
Perl <a href="http://www.ayni.com/perldoc/perlsec.html#Algorithmic-Complexity-Attacks">
randomizes its hash array hash functions</a>.
</p>
<p>
Now suppose we kept our guess order random then "solved" an empty
Sudoku puzzle. What we have is a solution to a randomly generated
Sudoku. To turn it into a puzzle, we just back it off a bit. A Sudoku
is only supposed to have a single unambiguous solution, so we can only
back off until just before the point where two solutions becomes
possible. If you imagine a solution tree, this would be backing up a
branch until you hit a fork.
</p>
<p>
Normally, Sudokus are symmetric (in the matrix sense), but completely
randomizing the position guessing order won't achieve this. To make
this work, the randomizing process can be adjusted to only select
random points on the upper triangle (including the diagonal). For each
point it selects <i>not</i> on the diagonal, the mirror point is
automatically selected next. This will preserve symmetry when
generating puzzles.
</p>
<p>
One issue remains: there seems to be no way to control the difficulty
of the puzzles it generates. Maybe a number of open spaces left behind
is a good metric? This will require some further study (and another
post!).
</p>
]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Variable Declarations and the C Call Stack</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2008/07/18/"/>
    <id>urn:uuid:6f6765dd-b2e8-333e-2908-72b969ce7bdf</id>
    <updated>2008-07-18T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<!-- 18 July 2008 -->
<p>
A co-worker asked me a question today about C/C++ pointers,
</p>
<blockquote>
  <p>
    If a pointer is declared inside a function with no explicit
    initialization, can I assume that the pointer is initialized
    to <code>NULL</code>?
  </p>
</blockquote>
<p>
We were down in the lab and, therefore, he had no Internet access to
look it up himself, which is why he asked. When I code C, it is just a
sort of mental habit to not use a non-static function variable without
first initializing it, but is this accurate? I <i>knew</i> the answer
was "no", but I wanted to be able to explain the "why".
</p>
<p>
Anyway, I quickly recalled some of my experimental C programs and
thought carefully about the mechanics of what is going on
behind-the-scenes, allowing me to confidently give him a "no"
answer. I then threw this together in a few seconds to prove it,
</p>

<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="nf">a</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="kt">int</span>   <span class="o">*</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="p">)</span> <span class="mh">0x012345FF</span><span class="p">;</span>
  <span class="kt">double</span>  <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mi">63454</span><span class="p">;</span>
  <span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="n">x</span><span class="p">;</span>
  <span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">b</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="kt">int</span>   <span class="o">*</span> <span class="n">x</span><span class="p">;</span>
  <span class="kt">double</span>  <span class="n">y</span><span class="p">;</span>
  <span class="n">printf</span> <span class="p">(</span><span class="s">"%p, %f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">main</span> <span class="p">()</span>
<span class="p">{</span>
  <span class="n">a</span> <span class="p">();</span>
  <span class="n">b</span> <span class="p">();</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>
When you compile it, make sure you don't use the optimization options
(<code>-O</code>, <code>-O2</code>, or <code>-O3</code>
for <code>gcc</code>) because they change the inner-workings of the
program. It might do things like make those functions inline (so they
won't be on the stack as I am intending), or even toss
out <code>a()</code>, as it appears to do nothing. The compiler sees
that, even though I "used" variables in <code>a()</code> by casting
them to <code>void</code>, nothing is really happening,
so <code>a()</code> can be ignored. We can probably get around this
with a tacked
on <code> <a href="http://en.wikipedia.org/wiki/Volatile_variable">
volatile</a></code> declaration, which you might see a lot of in a
micro-controller program. In a micro-controller, some memory addresses
are mapped to registers external to the software, so, from the
compiler's point of view, access to these locations may look like
nothing is really happening. Optimizing away variables that point to
these memory locations will lead to an incorrect binary, so your robot
or laser guided shark or whatever won't work.
</p>
<p>
Anyway, compiling with optimization will break my example! So don't do
it here.
</p>
<p>
When compiling, you should get some warnings about using uninitialized
variables, which is kind of the point of my example. Ignore it. That
warning alone gives away the answer to the main question, really, but
this example is a bit more fun!
</p>
<p>
Before you run it, study it and think about what the output should
look like. When <code>a()</code> is called, its stack frame goes into
the call stack, which contains the two declared variables. These
variables are then assigned as part of the function
execution. When <code>a()</code> returns, the frame is popped off the
stack. Then <code>b()</code> is called, and, as the variable
declarations are exactly the same, it will fit right over top
of <code>a()</code>'s old stack frame, and its variables will line
up. <code>x</code> and <code>y</code> are not assigned any value, so
they pick up whatever junk was lying around, which happens to be the
values assigned in <code>a()</code>.
</p>
<p>
When you run the program, this is the output,
</p>
<pre>
0x12345ff, -63454.000000
</pre>
<p>
The pointer is <i>not</i> initialized
to <code>NULL</code>. If <code>x</code> is passed back uninitialized
under the assumption that a <code>NULL</code> is being passed, some
other poor function that handles the return value may dereference it,
resulting in possibly
some <a href="http://www.catb.org/jargon/html/N/nasal-demons.html">
nasal demons</a>, but most likely an annoying segmentation
fault. Worse, this error may occur far, far away from where the actual
problem is, and even worse than that, only sometimes (depending on the
state of the call stack at just the right moment).
</p>
<p>
Note here that I am talking about non-static function variable
declarations. Global variables and static function variables will not
be on the stack. They are placed in a fixed location (in the data
segment), and their values are implicitly initialized to 0
at <i>compile time</i>.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>The 3n + 1 Conjecture</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2008/01/29/"/>
    <id>urn:uuid:147c1170-2d8b-38b3-bfbe-0652dc2e5e9a</id>
    <updated>2008-01-29T00:00:00Z</updated>
    <category term="c"/><category term="lua"/>
    <content type="html">
      <![CDATA[<!-- 29 January 2008 -->
<p>
The 3n + 1 conjecture, also known as
the <a href="http://en.wikipedia.org/wiki/Collatz_conjecture">Collatz
conjecture</a>, is based around this recursive function,
</p>
<p class="center">
  <img src="/img/misc/collatz.png" alt=""/>
</p>
<p>
The conjecture is this,
</p>
<blockquote>
  <p>
    This process will eventually reach the number 1, regardless of which
    positive integer is chosen initially.
  </p>
</blockquote>
<p>
The way I am defining this may not be entirely accurate, as I took a
shortcut to make it a bit simpler. I am not a mathematician (IANAM) —
but sometimes I pretend to be one. For a really solid definition,
click through to the Wikipedia article in the link above.
</p>
<p>
A sample run, starting at 7, would look like this: <code>7, 22, 11,
34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1</code>. The sequence
starting at 7 contains 17 numbers. So 7 has a <i>cycle-length</i> of
17. Currently, there is no known positive integer that does not
eventually lead to 1. If the conjecture is true, then none exists to
be found.
</p>
<p>
I first found out about the problem when I saw it
on <a href="http://icpcres.ecs.baylor.edu/onlinejudge/">UVa Online
Judge</a>. UVa Online Judge is a system that has a couple thousand
programming problems to do. Users can submit solution programs written
in C, C++, Java, or Pascal. For normal submissions, the fastest
program wins.
</p>
<p>
Anyway, the way UVa Online Judge runs this problem is by providing the
solution program pairs of integers on <code>stdin</code> as text. The
integers define an inclusive range of integers over which the program
must return the length of the longest Collatz cycle-length for all the
integers inside that range. They don't tell you which ranges they are
checking, except that all integers will be less than 1,000,000 and the
sequences will never overflow a 32-bit integer (allowing shortcuts to
be made to increase performance).
</p>
<p>
The simple approach would be defining a function that returns the
cycle length (<a href="http://www.lua.org/">Lua</a> programming
language),
</p>
<pre>
function collatz_len (n)
   local c = 1

   while n > 1 do
      c = c + 1
      if math.mod(n, 2) == 0 then
         n = n / 2
      else
         n = 3 * n + 1
      end
   end

   return c
end
</pre>
<p>
Then we have a function check over a range (assuming n &lt;= m here),
</p>
<pre>
function check_range (n, m)
   local largest = 0

   for i = n, m do
      local len = collatz_len (i)

      if len > largest then
         largest = len
      end

   end

   return largest
end
</pre>
<p>
And top it off with the i/o. (I am just learning Lua, so I hope I did
this part properly!)
</p>
<pre>
while not io.stdin.eof do
   n, m = io.stdin:read("*number", "*number")

   -- check for eof
   if n == nil or m == nil then
      break
   end

   print (n .. " " .. m .. " " .. check_range(n, m))
end
</pre>
<p>
Notice anything extremely inefficient? We are doing the same work over
and over again! Take, for example, this range: 7, 22. When we start
with 7, we get the sequence shown above: <code>7, 22, 11, 34, 17, 52,
26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1</code>. Eight of these numbers
are part of the range that we are looking at. When we get up to 22, we
are going to walk down the same range again, less the 7. To make
things more efficient, we apply
some <a href="http://en.wikipedia.org/wiki/Dynamic_programming">
dynamic programming</a> and store previous calculated cycle-lengths in
an array. Once we get to a value we already calculated, we just look
it up.
</p>
<p>
I used dynamic programming in my submission, which I wrote up in
C. You can grab my
source <a href="/download/collatz/collatz.c">
here</a>. It fills in a large array (1000000 entries) as values are
found, so no cycle-length is calculated twice. When I submitted this
program, it ranked 60 out of about 300,000 entries. There are probably
a number of tweaks that can increase performance, such as increasing
the size of the array, but I didn't care much about inching closer to
the top. I would bet that the very top entries did some
trial-and-error and determined what ranges are tested, using the
results to seed their program accordingly. You could take my code and
submit it yourself, but that wouldn't be very honest, would it?
</p>
<p>
So why am I going through all of this describing such a simple
problem? Well, it is because of this neat feature of Lua that applies
well to this problem. Lua is kind of like Lisp. In Lisp, everything is
a list ("list processing" --> Lis<i>p</i>). In Lua, (almost)
everything is an associative array (Maybe they should have called it
Assp? Or Hashp?  I am kidding.) An object is a hash with fields
containing function references. There is even
some <a href="http://en.wikipedia.org/wiki/Syntactic_sugar"> syntactic
sugar</a> to help this along.
</p>
<p>
The cool thing is that we can create a hash with default entries that
reference a function that calculates the Collatz cycle-length of its
key. Once the cycle-length is calculated, the function reference is
replaced with the value, so the function is never called again from
that point. The function only actually determines the next integer,
then references the hash to get the cycle-length of that next integer.
</p>
<p>
Now this hash looks like it is infinitely large. This is really a form
of <a href="http://en.wikipedia.org/wiki/Lazy_evaluation"> lazy
evaluation</a>: no values are calculated until they are needed (this
is one of my favorite things about <a href="http://www.haskell.org/">
Haskell</a>). We don't need to explicitly ask for it to be calculated,
either. We just go along looking up values in the array as if they
were always there. Here is how you do it,
</p>
<pre>
collatz_len = { 1 }

setmetatable (collatz_len, {
   __index = function (name, n)
      if (math.mod (n, 2) == 0) then
         name[n] = name[n/2] + 1;
      else
         name[n] = name[3 * n + 1] + 1;
      end
         return name[n]
   end
})
</pre>
<p>
So we replace the <code>collatz_len</code> function with this array
(and replace the call to an array reference) and we have applied
dynamic programming to our old program. If I run the two programs with
this sample input,
</p>
<pre>
10 1000
1000 3000
300 500
</pre>
<p>
and look at average running times, the dynamic programming version
runs 87% faster than the original.
</p>
<p>
One problem with this, though, is the use of recursion. In Lua, it is
really easy to hit recursion limits. For example, accessing element
10000 will cause the program to crash. This will probably get fixed
someday, or in some implementation of Lua.
</p>
<p>
I thought there might be a way to do this in Perl, by changing the
default hash value from <code>undef</code> to something else, but I
was mildly disappointed to find out that this is not true.
</p>
<p>
Here is the source for the original program and the one with dynamic
programming (BSD licenced):
<a href="/download/collatz/collatz_simple.lua">
  collatz_simple.lua</a> and
<a href="/download/collatz/collatz.lua">
  collatz.lua</a>
</p>
]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Optimizing, Multi-threading Brainfuck to C Converter</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2008/01/20/"/>
    <id>urn:uuid:48a1f67d-f523-38a6-015e-d023ba32a34c</id>
    <updated>2008-01-20T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<!-- 20 January 2008 -->
<p>
<a href="https://github.com/skeeto/bfc">wbf2c</a> converts an esoteric
programming language
called <a href="http://en.wikipedia.org/wiki/Brainfuck">brainfuck</a>
into C code which can be machine compiled. Several optimizations are
done to make the resulting program extremely fast an efficient. The
converter supports both a static (standard 30,000 cells) array or a
dynamically-sized array. It also supports many different cell types,
from the standard <code>char</code> to multi-precision cells
using <a href="http://gmplib.org/">GMP</a>.
</p>
<p>
The converter can also run several brainfuck programs on the same
memory array at once by running each one in a thread. To make sure
each brainfuck operation is atomic, each cell gets a mutex lock. The
only other multi-threading brainfuck implementation I know of
is <a href="http://www.clanpogo.dk/pogocms/index.php?action=bf">
Brainfork</a>.
</p>
<p>
For an example of some brainfuck code I wrote,
</p>
<pre>
+&gt;+&lt;
[
[-&gt;&gt;+&gt;+&lt;&lt;&lt;]&gt;&gt;&gt;
[-&lt;&lt;&lt;+&gt;&gt;&gt;]&lt;&lt;

[-&gt;+&gt;+&lt;&lt;]&gt;&gt;
[-&lt;&lt;+&gt;&gt;]&lt;&lt;
]
</pre>
<p>
This program fills the memory with the Fibonacci series. Make sure you
use the dynamically sized array, along with the bignum cell
type. After two or three seconds of running, my laptop (unmodified
Dell Inspiron 1000) can calculate and spit out a 140MB text file
containing the first 50,000 numbers in the series. I used
the <code>-d</code> dump option to see this output.
</p>
<p>
Download information, as well as some more examples, including a
multi-threaded one, are on the project website.
</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Mandelbrot Set on a Cluster</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/10/01/"/>
    <id>urn:uuid:fafbfeda-69a0-3d89-abc5-3fee63c83c9f</id>
    <updated>2007-10-01T00:00:00Z</updated>
    <category term="c"/>
    <content type="html">
      <![CDATA[<p>The project page with source code and documentation,</p>

<ul>
  <li><a href="https://github.com/skeeto/mandelbrot">https://github.com/skeeto/mandelbrot</a></li>
</ul>

<p>This the the second part of my post about clusters. Finally, here are
some more pretty <strong>pictures and video</strong> to look at! It is an extremely
parallel Mandelbrot set generator written in C.</p>

<p>This image was generated by my program on a cluster at my university,
and my favorite image generated so far.</p>

<p><a href="/img/fractal/mandelexp-huge.png"><img src="/video/fractal/mandel.jpg" alt="" /></a></p>

<p>The reason I built my own cluster was to run this program. The
generator forks off an arbitrary number of jobs (defined by a config
file) to generate a single fractal, or a fractal zoom sequence. The
cluster then automatically moves these jobs around to different nodes,
making the fractal generation fast.</p>

<p>I wrote it with two goals in mind. I wanted it to be parallel so that
it could easily take advantage of a cluster. I also wanted it to not
use any external libraries. This is because a cluster is often a
shared resource. Programs and libraries may only be available if
installed by an administrator, meaning that extra libraries like
<a href="http://www.libpng.org/">libpng</a> may not be available. For inter-process communication
the generator uses simple pipes. So all you need here is a POSIX
interface to the operating system, rather than some MPI
implementation.</p>

<p>I used <a href="http://ultra-premium.com/b">Andy Owen’s</a> handy <a href="http://www.ultra-premium.com/b/applications/image-source-0.1.tar.gz">bitmap library</a> for writing out
the bitmaps. I don’t know how I could have done without it!</p>

<p>The only thing you need in order to run the fractal generator is a C
compiler and a POSIX interface (GNU/Linux, *BSD, and other Unix-like
systems). Extra capabilities are available if gzip is installed.</p>

<p>I use <a href="http://matek.hu/xaos/doku.php">GNU Xaos</a> to find good locations in the set for zoom
sequences. It lets you zoom in real-time. Once I find a good spot, I
tell my generator to render some nice images as a zoom sequence to it.
Sometime I hope to write in an algorithm for auto-zooming. This way
the generator could create zoom sequences automatically for hours on
end. Xaos already has this capability for its real-time zooming.</p>

<p>An interesting thing I discovered: for these fractals at least, the
gzipped bitmaps are (barely) smaller than the equivalent PNG versions.
For the image above, the PNG version (produced by ImageMagick
defaults) is 11586185 bytes. The gzipped bitmap version is 11586074
bytes. Plus, gzipping is faster. On my laptop, BMP to PNG
(ImageMagick’s convert) took 13.678s. Gzipping (default options) took
5.167s. I am as surprised as you are.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  <entry>
    <title>PNG Archiver - Share Files Within Images</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/09/08/"/>
    <id>urn:uuid:1e1c84f7-c43b-3cef-80a2-d78b48a2e51b</id>
    <updated>2007-09-08T00:00:00Z</updated>
    <category term="c"/><category term="trick"/>
    <content type="html">
      <![CDATA[<p>This is one of my projects.</p>

<ul>
  <li><a href="https://github.com/skeeto/pngarch">PNG Archiver</a></li>
</ul>

<p>The original idea for this project came
from Sean Howard’s <a href="http://www.squidi.net/three/entry.php?id=12">Gameplay Mechanic’s #012</a>. The basic
idea here is that image files are the second easiest type of data to
share on the Internet (the first being text). Sharing anything other
than images may be difficult, so why not store files within an image
as the image data? This is not <a href="http://en.wikipedia.org/wiki/Steganography">steganography</a> as the data is
not being hidden. In fact, the data is quite obvious because we are
trying to make the data as compact as possible in the image.</p>

<p>My “PNG Archiver” is usable but should still be considered alpha
quality software. I am adding support for different types of PNGs
(currently it does 8-bit RGB only), but I have found that using the
libpng library gives me headaches. The archiver can actually only
store a single file (just as gzip doesn’t know what a file is). This
is because I do not want to duplicate the work of real file archivers
like <code class="language-plaintext highlighter-rouge">tar</code>. To store multiple files, make a “png-tarball”.</p>

<p>The PNG Archiver stores a checksum in the image that allows it to
verify that the data was received correctly. This also allows it to
automatically scan the image for data. When it reads in a piece that
fulfills the checksum it assumes that it found the data you are
looking for. You can decorate the image with text or a border and the
archiver should still find the data as long as you didn’t disturb it.
(examples of this on the project page)</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  

</feed>
