Articles tagged cpp at null program

I have officially retired from Emacs

2026-04-26T00:00:00Z

This article was discussed on reddit and on Hacker News.

This past Tuesday I typed C-x C-c in Emacs for the last time after 20 years of daily use. Though nearly half that time was gradually retiring it, switching to modal editing, then to Vim. Emacs is a platform, and I’d grown accustomed to its applications, especially those I built myself. There was no particular hurry, so replacements came slowly. With my newly-acquired superpowers I could knock out the last two pieces in a few days’ work, namely M-x calc with stackcalc and Elfeed with Elfeed2. I’m especially excited about the latter because it already exceeds the original. Both are multi-platform, native C++ GUI applications using native UI components.

These actively-in-use packages require new maintainers (apply on the project’s issues/discussion):

No wonder it took so long for me to move on! I’m not handing these off to just anyone, and you’ll need to establish your reputation. Having already made contributions is a good sign, even if never merged. I’m willing to transfer them off my namespace, though you’ll need to manage the Melpa hand-off (on which I’ll sign-off). If there are no takers, these projects will be archived but not deleted.

Trying out wxWidgets

The Emacs Calculator is amazing and the best calculator I’ve ever used, which is why nothing I could find was going to replace it. My clone uses GMP and MPFR for multi-precision, so it’s far faster, as to be expected, but it’s not nearly at feature parity. It’s missing esoteric features including symbolic processing. Though it’s enough to cover all of my own usage. I can add more features later. The Emacs Calculator manual served as a specification when building stackcalc.

Elfeed has been a cornerstone of my daily routines for the past 13 years. Nothing else I’ve found scratches that itch for me, so I’ve always known it would require a rewrite someday. Knowing it would take a few weeks of work, and that I already had the feed reader I wanted, made motivation difficult to find. Though now that I can accomplish ~3 weeks of old-way work in a new-way day, this sort of project becomes that much easier to start and finish. Though it’s not yet at a 1.0 release, after a couple days Elfeed2 was working well enough to replace the original Elfeed.

While Dear ImGui was the right choice for dcmake, it would not be so for these two applications. Active rendering doesn’t suit a feed reader left running all day, and I needed a richer toolkit. Professionally I work in Qt, but I wanted something lighter-weight for my projects, accessible via CMake FetchContent. That naturally led to wxWidgets. While it has issues — mitigatable character encoding problems, accidental quadratic time in many places — it’s worked better than I anticipated, letting me rapidly produce native-looking applications on Windows, macOS, and Linux.

Unlike Dear ImGui, wxWidgets is a platform, including sane I/O and path handling. I mostly don’t need platform layers when building applications like these. I can simply rely on wxWidgets’ utilities.

Both of these projects build out-of-the-box on w64devkit thanks to the dependencies being FetchContent-compatible. On all platforms you just need a C++ toolchain and CMake:

$ cmake -B build
$ cmake --build build

Now that I have experience with wxWidgets, learning its limitations and capabilities, it’s likely to be a foundation of most of my GUI projects to come, except where something like Dear ImGui is a better git.

My brave new code-signing world

2026-04-25T18:12:29Z

The new w64devkit release two weeks ago is the first to be code-signed with my identity, verified by Microsoft’s certificate chain. Currently only the release packaging is signed — the self-extracting archive and its payload — but I will soon code-sign individual EXEs and DLLs within the distribution. In fact, all Windows builds of my project releases have been code-signed the past two weeks, including dcmake, and so should everything going forward. My signing identity builds reputation with each download, so users will have an easier time with SmartScreen, and security software generally. Azure Artifact Signing creates the actual signature, but the rest is done with new infrastructure I built myself, aas-sign. As is often the case, the existing options were deficient for my needs, so I had to build it myself.

This code-signing is not free, and simply having aas-sign on hand, or using the GitHub Actions action, is insufficient. You must be serious enough to spend US$10/month for the Azure subscription. After that you are subjected to the labyrinth that is the Azure portal, the most confusing UI I’ve ever used. Luckily we live in an age of wonders, and I could describe to Claude in Chrome what I wanted and it would happen (Sonnet works better than Opus for this). It took as much time to figure out Azure as I spent creating a fully-functional, native debugger front-end. Clear your schedule if you’re going to try it yourself. If it weren’t for AI assistance I would have given up.

The one-time setup process is only open to North America, and involves sharing identify documents (i.e. driver’s license) with Microsoft. Unlike the rest of Azure, that part was streamlined and fairly painless. Between the cost and this requirement, this is a niche space.

However, if this is your niche, aas-sign is currently the best software available. It’s the tool Microsoft should have written, but didn’t due to ongoing institutional failures. The alternatives are a pair of tools: Azure CLI (Python) combined with either Jsign (Java) or SignTool.exe (Windows only). All impose artificial runtime constraints hostile to build pipeline composablility. Poor engineering. In contrast, aas-sign is a native, multi-platform, single-file application.

If you know this space, osslsigncode probably comes to mind, but it produces signatures itself. It doesn’t interface with Azure and so has no role here aside from semi-reliable validation. The most popular use case is code-signing with self-signed certificates, but that actually makes everything worse.

There are two modes for aas-sign: Laptop and Action. Laptop mode is the most compelling, so we’ll start with that, but Action mode is the most useful in practice.

Laptop/desktop mode

Suppose you built an EXE or DLL, and would like to code-sign and publish it. Typically that looks like this:

$ aas-sign sign myapp.exe myapp.dll

It computes an Authenticode for each (concurrently), sends it off to Azure, gets back a signature, then a countersignature, and embeds the signatures in the images. If you have multiple signing identities then you might use --as (“sign as”):

$ aas-sign sign --as eus:contoso:jdoe myapp.exe myapp.dll

The colon-delimited triple is my own invention to combine region (East US), tenant (Contoso), and profile (J. Doe) into one string. The first time you use it, and every ~90 days thereafter, you’ll need to authenticate with Azure first:

$ aas-sign login

This will open a browser (just like az login) to log in, from which it will obtain a token than can be used to obtain signing tokens. (Yes, a token to get tokens; I’m concealing as much complexity as possible.) You might also want to establish a default identity, as typically you’d only have one:

$ aas-sign config eus:contoso:jdoe

Or all at once:

$ aas-sign login eus:contoso:jdoe

My goal was, after enduring the Azure portal sign-up, to maximally streamline code-signing.

Action mode

Manually building, signing, and publishing releases is easy and might be fine if you’re not releasing too frequently — or too ininfrequently that you forget how to do it — but likely you’d want to automate this process. I was stubborn about it myself, until Peter0x44 pushed me hard enough to take it seriously, for which I’m grateful. There’s an official GitHub Action to code-sign with Azure, but it requires a Windows runner, fatally limiting for my own needs. So aas-sign also defines a code-signing action. The previous example would have this in its own action:

  - name: Sign
    uses: skeeto/aas-sign@v1.0.0
    with:
      endpoint:  ${{ secrets.TRUSTED_SIGNING_ENDPOINT }}
      account:   ${{ secrets.TRUSTED_SIGNING_ACCOUNT }}
      profile:   ${{ secrets.CERTIFICATE_PROFILE }}
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      files: |
        myapp.exe
        myapp.dll

The secrets are bunch of strings you (or your AI agent) retrieve from the Azure portal. You also need to create Federated Identity Credential (FIC) for each repository, which I suggest triggering on an environment. (This all may sound like a joke but it’s real.) Again, just ask an AI to do all this stuff. The mandatory Azure interfacing limits how much I can streamline this process. Then aas-sign combines these with per-job tokens GitHub injects into the runner to authenticate (via the FIC) and sign.

I’ve gone through this a number of times, and the AI breezes through the GitHub UI, but struggles through the Azure portal — objective evidence of how awful it is. Idea for a UI benchmark: How many AI tokens does it take to accomplish typical activities?

For w64devkit, my plan is to run aas-sign inside the Docker build and sign executables in the container before it’s SFX-packaged. This is impossible with SignTool.exe and needlessly frictional with Jsign (requires at least a JRE if not a JDK). The easiest path forward was to literally build my own tool from scratch.

I’m considering aas-sign as a new w64devkit command, but it’s so niche that I’m likely to be its sole user. On the other hand, those already running w64devkit in GitHub Actions could use it in Action mode to code-sign their builds without any additional tools.

dcmake: a new CMake debugger UI

2026-04-07T03:04:02Z

CMake has a --debugger mode since 3.27 (July 2023), allowing software to manipulate it interactively through the Debugger Adaptor Protocol (DAP), an HTTP-like protocol passing JSON messages. Debugger front-ends can start, stop, step, breakpoint, query variables, etc. a live CMake. When I came across this mode, I immediately conceived a project putting it to use. Thanks to recent leaps in software engineering productivity, I had a working prototype in 30 minutes, and by the end of that same day, a complete, multi-platform, native, GUI application. I named it dcmake (“debugger for CMake”). I’ve tested it on macOS, Windows, and Linux. Despite only being couple days old, it’s one of the coolest things I’ve ever built. Prior to 2026, I estimate it would have taken me a month to get the tool to this point.

It has a Dear ImGui interface, which I’ve experienced as a user but never built on myself before. Specifically the docking branch. In a sense it’s a toolkit for building debuggers, so it’s playing an enormous role in how quickly I put this project together. All of the “windows” tear out and may be free-floating or docked wherever you like, closely matching the classic Visual Studio UI. I borrowed all the same keybindings: F10 to step over, F11 to step in, F5 to start/continue, shift+F5 to stop. Click on line numbers to toggle breakpoints, right click to run-to-line, hover over variables with the mouse to see their values. Nearly every every UI state persists across sessions, and it opens nearly instantly.

This is just one of many situations I’ve used AI the past month for UI development, and it’s been shockingly effective. I can describe roughly the interface I want, and the AI makes it happen in a matter of minutes. It understands what I mean, filling in the details, sometimes anticipating what I’ll ask for next. If I’m unsure how I want a UI to work, it also offers good advice. If I need simple icons and such, it can draw those, too. It’s all incredibly empowering.

On macOS and Linux it runs on top of GLFW with OpenGL 3 rendering, and on Windows it uses native Win32 windowing and DirectX 11 rendering.

Program arguments given to dcmake populate the top-left arguments text input, which go straight into CMake on start. So you can prepend d to your CMake configuration command to run it inside the debugger. Passing no arguments sets it up for “standard” -B build configuration.

In general, if you don’t have anywhere in particular to look, likely the first thing to do after starting dcmake (in a project) is press F10. It starts CMake paused on the first line of CMakeLists.txt, or whatever script you’re debugging. If you’re trying out dcmake for the first time, that’s a good place to start. Keep pressing F10 to step through that script, watching it run through its configuration. If you F11 through the script then you’ll dive deeper and deeper into CMake itself, which can be insightful.

There is no point in trying to debug --build invocations. It’s just a uniform interface to the underlying build tool, and there is no CMake left to debug at that point. However, it does work with -P script mode invocations. CMake can operate as a platform-agnostic shell script-like tool, but unlike shell scripts you can step through them with a debugger like dcmake.

On Windows it supports Unicode paths all the way through, without a UTF-8 manifest. This took some special care, in particular avoiding any C++ standard library I/O functionality. Current frontier AI cannot handle this detail on their own. The macOS platform required a bit of Objective-C, as it often does, and I’m happy I didn’t have to figure that part out myself.

The next release of w64devkit will include dcmake, complementing its recent addition of CMake. This new tool has already proven useful in its own development.

2026 has been the most pivotal year in my career… and it's only March

2026-03-29T21:38:22Z

In February I left my employer after nearly two decades of service. In the moment I was optimistic, yet unsure I made the right choice. Dust settled, I’m now absolutely sure I chose correctly. I’m happier and better for it. There were multiple factors, but it’s not mere chance it coincides with these early months of the automation of software engineering. I left an employer that is years behind adopting AI to one actively supporting and encouraging it. As of March, in my professional capacity I no longer write code myself. My current situation was unimaginable to me only a year ago. Like it or not, this is the future of software engineering. Turns out I like it, and having tasted the future I don’t want to go back to the old ways.

In case you’re worried, this is still me. These are my own words. Writing is thinking, and it would defeat the purpose for an AI to write in my place on my personal blog. That’s not going to change.

I still spend much time reading and understanding code, and using most of the same development tools. It’s more like being a manager, orchestrating a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing the vegetables, I conjure a helper to do it while I continue to run the kitchen. I haven’t managed people in some 20 years now, but I can feel those old muscles being put to use again as I improve at this new role. Will these kitchens still need human chefs like me by the end of the decade? Unclear, and it’s something we all need to prepare for.

My situation gave me an experience onboarding with AI assistance — a fast process given a near-instant, infinitely-patient helper answering any question about the code. By second week I was making substantial, wide contributions to the large C++ code base. It’s difficult to attach a quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for certain this wouldn’t have been possible without AI. The bottlenecks have shifted from producing code, which now takes relatively no time at all, to other points, and we’re all still trying to figure it out.

My personal programming has transformed as well. Everything I said about AI in late 2024 is, as I predicted, utterly obsolete. There’s a huge, growing gap between open weight models and the frontier. Models you can run yourself are toys. In general, almost any AI product or service worth your attention costs money. The free stuff is, at minimum, months behind. Most people only use limited, free services, so there’s a broad unawareness of just how far AI has advanced. AI is now highly skilled at programming, and better than me at almost every programming task, with inhumanly-low defect rates. The remaining issues are mainly steering problems: If AI code doesn’t do what I need, likely the AI writing it didn’t understand what I needed.

I’ll still write code myself from time to time for fun — minimalist, with my style and techniques — the same way I play shogi on the weekends for fun. However, artisan production is uneconomical in the presence of industrialization. AI makes programming so cheap that only the rich will write code by hand.

A small part of me is sad at what is lost. A bigger part is excited about the possibilities of the future. I’ve always had more ideas than time or energy to pursue them. With AI at my command, the problem changes shape. I can comfortably take on complexity from which I previously shied away, and I can take a shot at any idea sufficiently formed in my mind to prompt an AI — a whole skill of its own that I’m actively developing.

For instance, a couple weeks ago I put AI to work on a problem, and it produced a working solution for me after ~12 hours of continuous, autonomous work, literally while I slept. The past month w64devkit has burst with activity, almost entirely AI-driven. Some of it architectural changes I’ve wanted for years, but would require hours of tedious work, and so I never got around to it. AI knocked it out in minutes, with the new architecture opening new opportunities. It’s also taken on most of the cognitive load of maintenance.

Quilt.cpp

So far the my biggest, successful undertaking is Quilt.cpp, a C++ clone of Quilt, an early, actively-used source control system for patch management. Git is a glaring omission from the almost complete w64devkit, due platform and build issues. I’ve thought Quilt could fill some of that source control hole, except the original is written in Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since Quilt is conceptually simple, and I could lean on busybox-w32 diff and patch, I’ve considered writing my own implementation, just as I did pkg-config, but I never found the energy to do it.

Then I got good enough with AI to knock out a near feature-complete clone in about four days, including a built-in diff and patch so it doesn’t actually depend on external tools (except invoking $EDITOR). On Windows it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases. The source is distributed as an amalgamation, a single file quilt.cpp per its namesake:

$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file]  [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt  --help" for details on a specific command.

It supports Windows and POSIX, and runs ~5x faster than the original. AI developed it on Windows, Linux, and macOS: It’s best when the AI can close the debug loop and tackle problems autonomously without involving a human slowpoke. The handful of “not implemented” parts aren’t because they’re too hard — each would probably take an AI ~10 minutes — but deliberate decisions of taste.

There’s an irony that the reason I could produce Quilt.cpp with such ease is also a reason I don’t really need it anymore.

I changed the output of quilt mail to be more Git-compatible. The mbox produced by Quilt.cpp can be imported into Git with a plain git am:

$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox

The idea being that I could work on a machine without Git (e.g. Windows XP), and copy/mail the mbox to another machine where Git can absorb it as though it were in Git the whole time. git format-patch to quilt import sends commits in the opposite direction, useful for manually testing Quilt.cpp on real change sets.

To be clear, I could not have done this if the original Quilt did not exist as a working program. I began with an AI generating a conformance suite based on the original, its documentation, and other online documentation, validating that suite against the original implementation (see -DQUILT_TEST_EXECUTABLE). Then had another AI code to the tests, on architectural guidance from me, with -D_GLIBCXX_DEBUG and sanitizers as guardrails. That was day one. The next three days were lots of refining and iteration as I discover the gaps in the test suite. I’d prompt AI to compare Quilt.cpp to the original Quilt man page, add tests for missing features, validate the new tests against the original Quilt, then run several agents to fix the tests. While they worked I’d try the latest build and note any bugs. As of this writing, the result is about equal parts test and non-test, ~9KLoC each.

I’m likely to use this technique to clone other tools with implementations unsuitable for my purposes. I learned quite a bit from this first attempt.

Why C++ instead of my usual choice of C? As we know, conventional C is highly error-prone. Even AI has trouble with it. In the ~9k lines of C++ that is Quilt.cpp, I am only aware of three memory safety errors by the AI. Two were null-terminated string issues with strtol, where the AI was essentially writing C instead of C++, after which I directed the AI to use std::from_chars and drop as much direct libc use as possible. (The other was an unlikely branch with std::vector::back on an empty vector.) We can rescue C with better techniques like arena allocation, counted strings, and slices, but while (current) state of the art AI understands these things, it cannot work effectively with them in C. I’ve tried. So I picked C++, and from my professional work I know AI is better at C++ than me.

Also like a manager, I have not read most of the code, and instead focused on results, so you might say this was “vibe-coded.” It is thoroughly tested, though I’m sure there are still bugs to be ironed out, especially on the more esoteric features I haven’t tried by hand yet.

Let’s discuss tools

After opposing CMake for years, you may have noticed the latest w64devkit now includes CMake and Ninja. What happened? Preparing for my anticipated employment change, this past December I read Professional CMake. I realized that my practical problems with CMake were that nearly everyone uses it incorrectly. Most CMake builds are a disaster, but my new-found knowledge allows me to navigate the common mistakes. Only high profile open source projects manage to put together proper CMake builds. Otherwise the internet is loaded with CMake misinformation. Similar to AI, if you’re not paying for CMake knowledge then it’s likely wrong or misleading. So I highly recommend that book!

Frontier AI is very good with CMake. When a project has a CMake build that isn’t too badly broken, just tell AI to fix it, without any specifics, and build problems disappear in mere minutes without having to think about it. It’s awesome. Combine it with the previous discussion about tests making AI so much more effective, and that it also knows CTest well, and you’ve got a killer formula. I’m more effective with CTest myself merely from observing how AI uses it. AI (currently) cannot use debuggers, so putting powerful, familiar testing tools in its hands helps a lot, versus the usual bespoke, debugger-friendly solutions I prefer.

Similar to solving CMake problems: Have a hairy merge conflict? Just ask AI resolve it. It’s like magic. I no longer fear merge conflicts.

So part of my motivation for adding CMake to w64devkit was anticipation of projects like Quilt.cpp, where they’d be available to AI, or at least so I could use the tools the AI used to build/test myself. It’s already paid for itself, and there’s more to come.

For agent software, on personal projects I’m using Claude Code. It’s a great value, cheaper than paying API rates but requires working around 5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to have enough to explore all my ideas. Be warned: Anthropic software is quite buggy, more so than industry average, and it’s obvious that they never even start, let alone test, some of their released software on disfavored platforms (Windows, Android). Don’t expect to use Claude Code effectively for native Windows platform development, which sadly includes w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a bottleneck on QA, and unable to fit AI in that role they don’t bother. You can theoretically report bugs on GitHub, but they’re just ignored and closed. (Why don’t they have AI agents jumping on this wealth of bug reports?)

At work I’m using Cursor where I get a choice of models. My favorite for March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude Code by a small margin. It’s immediately obvious that Cursor is better agent software than Claude Code. It’s more robust, more featureful, and with a clearer UI than Claude Code. It has no trouble on Windows and can drive w64devkit flawlessly. It’s also more expensive than Claude Code. My employer currently spends ~US$250/mo on my AI tokens, dirt cheap considering what they’re getting out of it. I have bottlenecks elsewhere that keep me from spending even more.

As a general rule, for software engineering always use the smartest model available. The cheaper, dumber models cost more in the long run. It takes more tokens to achieve worse results, which costs more human time to sort out.

Neither Cursor nor Claude Code are open source, so what are the purists to do, even if they’re willing to pay API rates for tokens? Sadly I have no answers for you. I haven’t gotten any open source agent software actually working, and it seems they may lack the necessary secret sauce.

Update: Several folks suggested I give OpenCode another shot, and this time I got over the configuration hurdle. Single executable, slick interface, and unlike Claude Code, I observed no bugs in my brief trial. Give that a shot if you’re looking for an open source client.

The future is going to be weird. My experience is only a peek at what’s to come, and my head is still spinning. However, the more I adapt to the changes, the better I feel. If you’re feeling anxious like I was, don’t flinch from improving your own AI knowledge and experience.

Speculations on arenas and non-trivial destructors

2025-10-16T20:11:22Z

As I continue to reflect on arenas and lifetimes in C++, I realized that dealing with destructors is not so onerous. In fact, it does not even impact my established arena usage! That is, implicit RAII-style deallocation at scope termination, which works even in plain old C. With a small change we can safely place resource-managing objects in arenas, such as those owning file handles, sockets, threads, etc. (Though the ideal remains resource management avoidance when possible.) We can also place traditional, memory-managing C++ objects in arenas, too. Their own allocations won’t come from the arena — either because they lack the interfaces to do so, or they’re simply ineffective at it (pmr) — but they will reliably clean up after themselves. It’s all exception-safe, too. In this article I’ll update my arena allocator with this new feature. The change requires one additional arena pointer member, a bit of overhead for objects with non-trivial destructors, and no impact for other objects.

I continue to title this “speculations” because, unlike arenas in C, I have not (yet?) put these C++ techniques into practice in real software. I haven’t refined them through use. Even ignoring its standard library as I do here, C++ is an enormously complex programming language — far more so than C — and I’m less confident that I’m not breaking a rule by accident. I only want to break rules with intention!

As a reminder here’s where we left things off:

struct Arena {
    char *beg;
    char *end;
};

template<typename T>
T *raw_alloc(Arena *a, ptrdiff_t count = 1)
{
    ptrdiff_t size = sizeof(T);
    ptrdiff_t pad  = -(uintptr_t)a->beg & (alignof(T) - 1);
    if (count >= (a->end - a->beg - pad)/size) {
        throw std::bad_alloc{};  // OOM policy
    }
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return new(r) T[count]{};
}

I used throw when out of memory mainly to emphasize that this works, but you’re free to pick whatever is appropriate for your program. Remember, that’s the entire allocator, including implicit deallocation, sufficient to fulfill the allocation needs for most programs, though they must be designed for it. Also note that it’s now raw_alloc, as we’ll be writing a new, enhanced alloc that builds upon this one.

Also a reminder on usage, I’ll draw on an old example, updated for C++:

wchar_t   *towidechar(Str, Arena *);   // convert to UTF-16
Str        slurpfile(wchar_t *path);   // read an entire file
Slice<Str> split(Str, char, Arena *);  // split on delimiter

Slice<Str> getlines(Str path, Arena *perm, Arena scratch)
{
    // Use scratch for path conversion, auto-free on return
    wchar_t *wpath = towidechar(path, &scratch);

    // Use perm for file contents, which are returned
    Str buf = slurpfile(wpath, perm);

    // Use perm for the slice, pointing into buf
    return split(buf, '\n', perm);
}

Changes to scratch do not persist after getlines returns, so objects allocated from that arena are automatically freed on return. So far this doesn’t rely on C++ RAII features, just simple value semantics. It works well because all the objects in question have trivial destructors. But suppose there’s a resource to manage:

struct TcpSocket {
    int socket = ::socket(AF_INET, SOCK_STREAM, 0);
    TcpSocket() = default;
    TcpSocket(TcpSocket &) = delete;
    void operator=(TcpSocket &) = delete;
    // TODO: move ctor/operator
    ~TcpSocket() { if (socket >= 0) close(socket); }
    operator int() { return socket; }
};

If we allocate a TcpSocket in an arena, including as a member of another object, the destructor will never run unless we call it manually. To deal with this we’ll need to keep track of objects requiring destruction, which we’ll do with a linked list of destructors, forming a LIFO stack:

struct Dtor {
    Dtor     *next;
    void     *objects;
    ptrdiff_t count;
    void     (*dtor)(void *objects, ptrdiff_t count);
};

Each Dtor points to a homogeneous array, a count (typically one), and a pointer to a function that knows how to destroy these objects. The linked list itself is heterogeneous, with dynamic type. The function pointer is like a kind of type tag. The dtor functions will be generated using a template function:

template<class T>
void destroy(void *ptr, ptrdiff_t count)
{
    T *objects = (T *)ptr;
    for (ptrdiff_t i = count-1; i >= 0; i--) {
        objects[i].~T();
    }
}

Notice it destroys end-to-beginning, in reverse order that these objects would be instantiated by placement new[]. It’s essentially a placement delete[]. An arena initializes with an empty list of Dtors as a new member:

struct Arena {
    char *beg;
    char *end;
    Dtor *dtors = 0;

    // ...

};

There are two different ways to construct an arena: over a block of raw memory (unowned), or from an existing arena to borrow a scratch arena over its free space. So that’s two constructors:

struct Arena {
    // ...

    Arena(char *mem, ptrdiff_t len) : beg{mem}, end{mem+len} {}
    Arena(Arena &a) : beg{a.beg}, end{a.end} {}

    // ...
};

Finally a destructor that pops the Dtor linked list until empty, which runs the destructors in reverse order when the arena is destroyed:

struct Arena {
    // ...

    void operator=(Arena &) = delete;  // rule of three

    ~Arena()
    {
        while (dtors) {
            Dtor *dead = dtors;
            dtors = dead->next;
            dead->dtor(dead->objects, dead->count);
        }
    }
};

(Note: This should probably use a local variable instead of manipulating the dtors member directly. Updates to dtors are potentially visible to destructors, inhibiting optimization.) The new, enhanced alloc building upon raw_alloc:

template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
    if (__has_trivial_destructor(T) || !count) {
        return raw_alloc<T>(a, count);
    }

    Dtor *dtor    = raw_alloc<Dtor>(a);  // allocate first
    T    *r       = raw_alloc<T>(a, count);
    dtor->next    = a->dtors;
    dtor->objects = r;
    dtor->count   = count;
    dtor->dtor    = destroy<T>;

    a->dtors = dtor;
    return r;
}

I’m using the non-standard __has_trivial_destructor built-in supported by all major C++ implementations, meaning we still don’t need the C++ standard library, but std::is_trivially_destructible is the usual tool here. LLVM is pushing __is_trivially_destructible instead, but it’s not supported by GCC until GCC 16.

Since it’s so simple to do it, if the count is zero then it doesn’t care about non-trivial destruction, as there’s nothing to destroy. Things get more interesting for a non-zero number of non-trivially destructible objects. First allocate a Dtor, important because failing to allocate it second would cause a leak (no Dtor entry in place). Then allocate the array, attach it to the Dtor, attach the Dtor to the arena, registering the objects for cleanup.

If a constructor throws, placement new[] will automatically destroy objects that have been created so far — i.e. the real placement delete[] — before returning, so that case was already covered at the start.

With a little more cleverness we could omit the objects pointer and discover the array using pointer arithmetic off the Dtor object itself. That’s tricky (consider alignment), and generally unnecessary, so I didn’t worry about it. With arenas, allocator overhead is already well below that of conventional allocation, so slack is plentiful. Chances are we will also never need an array of non-trivially destructible objects, and so we could probably omit count, then write a single-object allocator that forwards constructor arguments (e.g. a handles to the resource to be managed). That involves no new concepts, and I leave it as an exercise for the reader.

With that in place, we could now allocate an array of TcpSockets:

void example(Arena scratch)
{
    TcpSocket *sockets = alloc<TcpSocket>(&scratch, 100);
    // ...
}

These sockets will all be closed when example exits via their singular Dtor entry on scratch. When calling this example with an arena:

void caller(Arena *perm)
{
    example(*perm);  // creates a scratch arena
    // ...
}

This invokes the copy constructor, creating a scratch arena with an empty dtors list to be passed into example. Objects existing in *perm will not be destroyed by example because dtors isn’t passed in. If we had passed a pointer to an arena, the Arena constructor isn’t invoked, so the callee uses the caller’s arena, pushing its Dtors onto the callee’s list.

In other words, the interface hasn’t changed! That’s the most exciting part for me. This by-copy, by-pointer interfacing has really grown on me the past two years.

More speculations on arenas in C++

2025-09-30T11:46:16Z

Update October 2025: further enhancements.

Patrice Roy’s new book, C++ Memory Management, has made me more conscious of object lifetimes. C++ is stricter than C about lifetimes, and common, textbook memory management that’s sound in C is less so in C++ — more than I realized. The book also presents a form of arena allocation so watered down as to enjoy none of the benefits. (Despite its precision otherwise, the second half is also littered with integer overflows lacking the appropriate checks, and near the end has some pointer overflows invalidating the check.) However, I’m grateful for the new insights, and it’s made me revisit my own C++ arena allocation. In this new light I see I got it subtly wrong myself!

Surprising to most C++ programmers, but not language lawyers, idiomatic C memory allocation was ill-formed in C++ until recently:

int *newint(int v)
{
    int *r = (int *)malloc(sizeof(*r));
    if (r) {
        *r = v;  // <-- undefined behavior before C++20
    }
    return r;
}

This program allocates memory for an object but never starts a lifetime. Assignment without a lifetime is invalid. Pointer casts are that much more suspicious in C++, and due to lifetime semantics, in many cases indicate incorrect code. (To be clear, I’m not arguing in favor of these semantics, but reasoning about the facts on the ground.) C++20 carved out special exceptions for malloc and friends, but addressing this kind of thing in general is the purpose of the brand new start_lifetime_as (and similar), the slightly older construct_at, or a classic placement new. They all start lifetimes. The last looks like:

int *newint(int v)
{
    void *r = malloc(sizeof(int));
    if (r) {
        return new(r) int{v};
    }
    return nullptr;
}

That’s no good as a C/C++ polyglot, though per the differing old semantics that was impossible anyway without macros. Which is basically cheating. An important detail: The corrected version has no casts, and it returns the result of new. That’s important because only the pointer returned by new is imbued as a pointer to the new lifetime, not r. There are no side effects affecting the provenance of r, which still points to raw memory as far as the language is concerned.

With that in mind let’s revisit my arena from last time, which does not necessarily benefit from the recent changes, not being one of the special case C standard library functions:

struct Arena {
    char *beg;
    char *end;
};

template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
    ptrdiff_t size = sizeof(T);
    ptrdiff_t pad  = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - pad)/size);  // OOM policy
    T *r = (T *)(a->beg + pad);
    a->beg += pad + count*size;
    for (ptrdiff_t i = 0; i < count; i++) {
        new((void *)&r[i]) T{};
    }
    return r;
}

Hey, look, placement new! I did that to produce a nicer interface, but I lucked out also starting lifetimes appropriately. Except it returns the wrong pointer. This allocator discards the pointer blessed with the new lifetime. Both pointers have the same address but different provenance. That matters. But I’m calling new many times, so how do I fix this? Array new, duh.

template<typename T>
T *alloc(Arena *a, ptrdiff_t count = 1)
{
    ptrdiff_t size = sizeof(T);
    ptrdiff_t pad  = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - pad)/size);  // OOM policy
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return new(r) T[count]{};
}

Wow… that’s actually much better anyway. No explicit casts, no loop. Why didn’t I think of this in the first place? The catch is I can’t forward constructor arguments, emplace-style — the part that gave me the trouble with perfect forwarding — but that’s for the best. Forwarding more than once was unsound, made more obvious by new[].

Caveat: This only works starting in C++20, and strictly with operator new[](size_t, void *). Any other placement new[] may require array overhead — e.g. it prepends an array size so that delete[] can run non-trivial destructors — which is unknowable and therefore impossible to provide or align correctly. Overhead for placement new[] is nonsense, of course, but as of this writing, all three major C++ compilers do it and essentially have broken custom placement new[].

Since I’m thinking about lifetimes, what about the other end? My arena does not call destructors, by design, and starts new lifetimes on top of objects that are technically still alive. Is that undefined behavior? As far as I can tell this is allowed, even for non-trivial destructors, with the caveat that it might leak resources. In this case the resource is memory managed by the arena, so that’s fine of course.

So addressing pointer provenance also produced a nicer definition. What a great result from reading that book! While researching, I noticed Jonathan Müller, who personally gave me great advice and feedback on my previous article, talked about lifetimes just a couple weeks later. I recommend both.

Tips for more effective fuzz testing with AFL++

2025-02-05T18:03:55Z

Fuzz testing is incredibly effective for mechanically discovering software defects, yet remains underused and neglected. Pick any program that must gracefully accept complex input, written in any language, which has not yet been been fuzzed, and fuzz testing usually reveals at least one bug. At least one program currently installed on your own computer certainly qualifies. Perhaps even most of them. Everything is broken and low-hanging fruit is everywhere. After fuzz testing ~1,000 projects over the past six years, I’ve accumulated tips for picking that fruit. The checklist format has worked well in the past (1, 2), so I’ll use it again. This article discusses AFL++ on source-available C and C++ targets, running on glibc-based Linux distributions, currently the indisputable best fuzzing platform for C and C++.

My tips complement the official, upstream documentation, so consult them, too:

Performance Tips on the AFL++ website
Technical “whitepaper” for afl-fuzz

Even if a program has been fuzz tested, applying the techniques in this article may reveal defects missed by previous fuzz testing.

(1) Configure sanitizers and assertions

More assertions means more effective fuzzing, and sanitizers are a kind of automatically-inserted assertions. By default, fuzz with both Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan):

$ afl-gcc-fast -g3 -fsanitize=address,undefined ...

ASan’s default configuration is not ideal, and should be adjusted via the ASAN_OPTIONS environment variable. If customized at all, AFL++ requires at least these options:

export ASAN_OPTIONS="abort_on_error=1:halt_on_error=1:symbolize=0"

Except symbolize=0, this ought to be the ASan default. When debugging a discovered crash, you’ll want UBSan set up the same way so that it behaves under in a debugger. To improve fuzzing, make ASan even more sensitive to defects by detecting use-after-return bugs. It slows fuzzing slightly, but it’s well worth the cost:

ASAN_OPTIONS+=":detect_stack_use_after_return=1"

By default ASan fills the first 4KiB of fresh allocations with a pattern, to help detect use-after-free bugs. That’s not nearly enough for fuzzing. Crank it up to completely fill virtually all allocations with a pattern:

ASAN_OPTIONS+=":max_malloc_fill_size=$((1<<30))"

In the default configuration, if a program allocates more than 4KiB with malloc then, say, uses strlen on the uninitialized memory, no bug will be detected. There’s almost certainly a zero somewhere after 4KiB. Until I noticed it, the 4KiB limit hid a number of bugs from my fuzz testing. Per (4), fulling filling allocations with a pattern better isolates tests when using persistent mode.

When fuzzing C++ and linking GCC’s libstdc++, consider -D_GLIBCXX_DEBUG. ASan cannot “see” out-of-bounds accesses within a container’s capacity, and the extra assertions fill in the gaps. Mind that it changes the ABI, though fuzz testing will instantly highlight such mismatches.

(2) Prefer the persistent mode

While AFL++ can fuzz many programs in-place without writing a single line of code (afl-gcc, afl-clang), prefer AFL++’s persistent mode (afl-gcc-fast, afl-clang-fast). It’s typically an order of magnitude faster and worth the effort. Though it also has pitfalls (see (4), (5)). I keep a file on hand, fuzztmpl.c — the progenitor of all my fuzz testers:

#include 

__AFL_FUZZ_INIT();

int main(void)
{
    __AFL_INIT();
    char *src = 0;
    unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF;
    while (__AFL_LOOP(10000)) {
        int len = __AFL_FUZZ_TESTCASE_LEN;
        src = realloc(src, len);
        memcpy(src, buf, len);
        // ... send src to target ...
    }
}

I :r this into my Vim buffer, then modify as needed. It’s a stripped and improved version of the official template, which itself has a serious flaw (see (5)). There are unstated constraints about the position of buf and len in the code, so if in doubt, refer to the original template.

(3) Include source files, not header files

We’re well into the 21st century. Nobody is compiling software on 16-bit machines anymore. Don’t get hung up on the one translation unit (TU) per source file mindset. When fuzz testing, we need at most two TUs: One TU for instrumented code and one TU for uninstrumented code. In most cases the latter takes the form of a library (libc, libstdc++, etc.) and we don’t need to think about it.

Fuzz testing typically requires only a subset of the program. Including just those sources straight in the template is both effective and simple. In my template I put includes just above unistd.h so that the header isn’t visible to the sources unless they include it themselves.

#include "src/utils.c"
#include "src/parser.c"
#include 

I know, if you’ve never seen this before it looks bonkers. This isn’t what they taught you in college. Trust me, this simple technique will save you a thousand lines of build configuration. Otherwise you’ll need to manage different object files between fuzz testing and otherwise.

Perhaps more importantly, you can now fuzz test any arbitrary function in the program, including static functions! They’re all right there in the same TU. You’re not limited to public-facing interfaces. Perhaps you can skip (7) and test against a better internal interface. It also gives you direct access to static variables so that you can clear/reset them between tests, per (4).

Programs are often not designed for fuzz testing, or testing generally, and it may be difficult to tease apart tightly-coupled components. Many of the programs I’ve fuzz tested look like this. This technique lets you take a hacksaw to the program and substitute troublesome symbols just for fuzz testing without modifying a single original source line. For example, if the source I’m testing contains a main function, I can remove it:

#define main oldmain
#  include "src/utils.c"
#  include "src/parser.c"
#undef main
#include 

Sure, better to improve the program so that such hacks are unnecessary, but most cases I’m fuzz testing as part of a drive-by review of some open source project. It allows me to quickly discover defects in the original, unmodified program, and produces simpler bug reports like, “Compile with ASan, open this 50-byte file, and then the program will crash.”

(4) Isolate fuzz tests from each other

Tests should be unaffected by previous tests. This is challenging in persistent mode, sometimes even impractical. That means resetting all global state, even something like the internal strtok buffer if that function is used. Add fuzz testing to your list of reasons to eschew global variables.

It’s mitigated by (1), but otherwise uninitialized heap memory may hold contents from previous tests, breaking isolation. Besides interference with fuzzing instrumentation, bugs found this way are wickedly difficult to reproduce.

Don’t pass uninitialized memory into a test, e.g. an output parameter allocated on the stack. Zero-initialize or fill it with a pattern. If it accepts an arena, fill it with a pattern before each test.

Typically you have little control over heap addresses, which likely varies across tests and depends on the behavior previous tests. If the program depends on address values, this may affect the results and make reproduction difficult, so watch for that.

(5) Do not test directly on the fuzz test buffer

Passing buf and len straight into the target is the most common mistake, especially when fuzzing better-designed C programs, and particularly because the official template encourages it.

    myprogram(buf, len);  // BAD!

While it’s a great sign the program doesn’t depend on null termination, it creates a subtle trap. The underlying buffer allocated by AFL++ is larger than len, and ASan will not detect read overflows on inputs! Instead pass a copy sized to fit, which is the purpose of src in my template. Adjust the type of src as needed.

If the program expects null-terminated input then you’ll need to do this anyway in order to append the null byte. If it accepts an “owning” type like std::string, then it’s also already done on your behalf. With “non-owning” views like std::string_view you’ll still want to your own size-fit copy.

If you see a program’s checked in fuzz test using buf directly, make this change and see if anything new pops out. It’s worked for me on a number of occasions.

(6) Don’t bother freeing memory

In general, avoid doing work irrelevant to the fuzz test. The official tips say to “use a simpler target” and “instrument just what you need,” and keeping destructors out of the tests helps in both cases. Unless the program is especially memory-hungry, you won’t run out of memory before AFL++ resets the target process.

If not for (1), it also helps with isolation (4), as different tests are less likely contaminated with uninitialized memory from previous tests.

As an exception, if you want your destructor included in the fuzz test, then use it in the test. Also, it’s easy to exhaust non-memory resources, particularly file descriptors, and you may need to clean those up in order to fuzz test reliably.

Of course, if the target uses arena allocation then none of this matters! It also makes for perfect isolation, as even addresses won’t vary between tests.

(7) Use a memory file descriptor to back named paths

Many interfaces are, shall we say, not so well-designed and only accept input from a named file system path, insisting on opening and reading the file themselves. Testing such interfaces presents challenges, especially if you’re interested in parallel fuzzing. Fortunately there’s usually an easy out: Create a memory file descriptor and use its /proc name.

int fd = memfd_create("fuzz", 0);
assert(fd == 3);
while (...) {
    // ...
    ftruncate(fd, 0);
    pwrite(fd, buf, len, 0);
    myprogram("/proc/self/fd/3");
}

With standard input as 0, output as 1, and error as 2, I’ve assumed the memory file descriptor will land on 3, which makes the test code a little simpler. If it’s not 3 then something’s probably gone wrong anyway, and aborting is the best option. If you don’t want to assume, use snprintf or whatever to construct the path name from fd.

Using pwrite (instead of write) leaves the file description offset at the beginning of the file.

Thanks to the memory file descriptor, fuzz test data doesn’t land in permanent storage, so less wear and tear on your SSD from the occasional flush. Because of /proc, the file is unique to the process despite the common path name, so no problems parallel fuzzing. No cleanup needed, either.

If the program wants a file descriptor — i.e. it wants a socket because you’re fuzzing some internal function — pass the file descriptor directly:

    myprogram(fd);

If it accepts a FILE *, you could fopen the /proc path, but better to use fdmemopen to create a FILE * on the object:

    myprogram(fdmemopen(buf, len, "rb"));

Note how, per (6), we don’t need to bother with fclose because it’s not associated with a file descriptor.

(8) Configure the target for smaller buffers

A common sight in diseased programs are “generous” fixed buffer sizes:

#define MY_MAX_BUFFER_LENGTH 65536

void example(...)
{
    char path[PATH_MAX];  // typically 4,096
    char buf[MY_MAX_BUFFER_LENGTH];
    // ...
}

These huge buffers tend to hide bugs. Turn those stones over! It takes a lot of fuzzing time to max them out and excite the unhappy paths — or the super-unhappy paths, overflows. Better if the fuzz test can reach worst case conditions quickly and explore the execution paths out of it.

So when you see these, cut them way down, possibly using (3). Change 65536 to, say, 16 and see what happens. If fuzzing finds a crash on the short buffer, typically extending the input to crash on the original buffer size is straightforward, e.g. repeat one of the bytes even more than it already repeats.

Conclusion and samples

Hopefully something here will help you catch a defect that would have otherwise gone unnoticed. Even better, perhaps awareness of these fuzzing techniques will prevent the bug in the first place. Thanks to my template, some solid tooling, and the know-how in this article, I can whip up a fuzz test in a couple of minutes. But that ease means I discard it as just as casually, and so I don’t take time to capture and catalog most. If you’d like to see some samples, I do have an old, short list. Perhaps after another kiloproject of fuzz testing I’ll pick up more techniques.

Rules to avoid common extended inline assembly mistakes

2024-12-20T19:46:48Z

GCC and Clang inline assembly is an interface between high and low level programming languages. It is subtle and treacherous. Many are ensnared in its traps, usually unknowingly. As such, the asm keyword is essentially the unsafe keyword of C and C++. Nearly every inline assembly tutorial, including the awful ibilio page at the top of search engines for decades, propagate fundamental, serious mistakes, and most examples are incorrect. The dangerous part is that the examples usually produce the expected results! The situation is dire. This article isn’t a tutorial, but basic rules to avoid the most common mistakes, or to spot them in code review.

The focus is entirely extended assembly, and not basic assembly, which has different rules. The former is any inline assembly statement with constraints or clobbers. That is, there’s a colon : token between the asm parenthesis. Basic assembly is blunt and has fewer uses, mostly at the top level or in “naked” functions, making misuse less likely.

(1) Avoid inline assembly if possible

Because it’s so treacherous, the first rule is to avoid it if at all possible. Modern compilers are loaded with intrinsics and built-ins that replace nearly all the old inline assembly use cases. They allow access to low level features from the high level language. No need to bridge the gap between low and high yourself when there’s an intrinsic.

Compilers do not have built-ins for system calls, and occasionally lack a useful intrinsic. Other times you might be building foundational infrastructure. These remaining cases are mostly about interacting with external interfaces, not optimization nor performance.

(2) It should nearly always be volatile

Falling right out of rule (1), the remaining inline assembly cases nearly always have side effects beyond output constraints. That includes memory accesses, and it certainly includes system calls. Because of this, inline assembly should usually have the volatile qualifier.

asm volatile ( ... );

This prevents compilers from eliding or re-ordering the assembly. As a special rule, inline assembly lacking output constraints is implicitly volatile. Despite this, please use volatile anyway! When I do not see volatile it’s likely a defect. Stopping to consider if it’s this special case slows understanding and impedes code review.

Tutorials often use __volatile__. Do not do this. It is an ancient alias keyword to support pre-standard compilers lacking the volatile keyword. This is not your situation. When I see __volatile__ it likely means you copy-pasted the inline assembly from somewhere without understanding it. It’s a red flag that draws my attention for even more careful review.

Side note: __asm or __asm__ is fine, and even required in some cases (e.g. -std=cXX). I usually write it asm.

(3) It probably needs a memory clobber

The "memory" clobber is orthogonal to volatile, each serving different purposes. It’s less often needed than volatile, but typical remaining inline assembly cases require it. If memory is accessed in any way while executing the assembly, you need a memory clobber. This includes most system calls, and definitely a generic syscall wrapper.

    asm volatile (... : "memory");

In code review, if you do not see a "memory" clobber, give it extra scrutiny. It’s probably missing. If it’s truly unnecessary, I suggest documenting such in a comment so that reviewers know the omission is considered and intentional.

The constraint prevents compilers from re-ordering loads and stores around the assembly. It would be disastrous, for example, if a write(2) system call occurred before the program populated the output buffer! In this case, volatile would prevent followup write(2) from being optimized out while "memory" forces memory stores to occur before the system call.

(4) Never modify input constraints

It’s easy not to modify inputs, so this is mostly about ignorance, but this rule is broken with shocking frequency. Most of the time you can get away with it, right up until certain configurations have a heisenbug. In most cases this can be fixed by changing an input into read-write output constraint with "+":

asm volatile ("..." :: "r"(x) : ...);  // before
asm volatile ("..." : "+r"(x) : ...);  // after

If you hadn’t been using volatile (in violation of rule 2) then now suddenly you’d need it because there’s an output constraint. This happens often.

(5) Never call functions from inline assembly

Many things can go wrong because the semantics cannot be expressed using inline assembly constraints. The stack may not be aligned, and you’ll clobber the redzone. (Yes, there’s a "redzone" constraint, but its insufficient to actually make a function call.) Do not do it. Tutorials like to show it because it makes for a simple demonstration, but all those examples are littered with defects.

System calls are fine. Basic assembly may call functions when used outside of non-naked functions. The goto qualifier, used correctly, allows jumps to be safely expressed to the compiler. Just don’t use call in extended assembly.

(6) Do not define absolute assembly labels

That is, if you need to jump within your assembly block, such as for a loop, do not write a named label:

myloop:
    ...
    jz myloop

Your inline assembly is part of a function, and that function may be cloned or inlined, in which case there will be multiple copies of your assembly block in the translation unit. The assembler will see duplicate label names and reject the program. Until that function is inlined, perhaps at a high optimization level, this will likely work as expected. On the plus side it’s a loud compile time error when it doesn’t work.

In inline assembly you can have the compiler generate a unique label with %=, but my preferred solution is the local labels feature of the assembler:

0:
    ...
    jz 0b

In this case the assembler generates unique labels, and the number 0 isn’t the literal label name. 0b (“backward”) refers to the previous 0 label, and 0f (“forward”) would refer to the next 0 label. Perfectly unambiguous.

Naturally occurring practice problems

Now that you’ve made it this far, here’s an exercise for practice: Search online for “inline assembly tutorial” and count the defects you find by applying my 6 rules. You’ll likely find at least one per result that isn’t official compiler documentation. Besides tutorials and reviewing real programs, you could ask an LLM to generate inline assembly, as they’ve been been trained to produce these common defects.

Slim Reader/Writer Locks are neato

2024-10-03T22:40:13Z

I’m 18 years late, but Slim Reader/Writer Locks have a fantastic interface: pointer-sized (“slim”), zero-initialized, and non-allocating. Lacking cleanup, they compose naturally with arena allocation. Sounds like a futex? That’s because they’re built on futexes introduced at the same time. They’re also complemented by condition variables with the same desirable properties. My only quibble is that slim locks could easily have been 32-bit objects, but it hardly matters. This article, while treating Win32 as a foreign interface, discusses a paper-thin C++ wrapper interface around lock and condition variables, in my own style.

If you’d like to see/try a complete, working demonstration before diving into the details: demo.cpp. We’re going to build this from the ground up, so let’s establish a few primitive integer definitions:

using b32 = signed;
using i32 = signed;
using uz  = decltype(0uz);

Think of uz as like uintptr_t. This implementation will support both 32-bit and 64-bit targets, and we’ll need it as the basis for locks and condition variables:

enum Lock : uz;
enum Cond : uz;

Opaque enums provide additional type safety: They have the properties of an integer, including trivial destruction, but are distinct types which compilers forbid mixing with other integers. We can’t, say, accidentally cross condition variable and lock parameters — my main concern. Aside from zero-initialization, we do not actually care about the values of these variables, so enumerators are unnecessary. (Caveat: GDB cannot display opaque enums, which is slightly irritating.)

The documentation doesn’t explicitly mention zero initialization, but the official *_INIT constants are defined as zero. That locks in zero at the ABI level, so we can count on it.

All the functions we’ll need are exported by kernel32.dll. Locks have two variations on lock/unlock: “exclusive” (write) and “shared” (read). There are also “try” versions, but I won’t be using them.

#define W32(r, p) extern "C" __declspec(dllimport) r __stdcall p noexcept
W32(void, AcquireSRWLockExclusive(Lock *));
W32(void, AcquireSRWLockShared(Lock *));
W32(void, ReleaseSRWLockExclusive(Lock *));
W32(void, ReleaseSRWLockShared(Lock *));

Declaring Win32 functions in C++ is a mouthful, and everything must be written in just the right order, but it’s mostly tucked away in a macro. Usually there’s a stack discipline to these locks, so an RAII scoped guard is in order:

struct Guard {
    Lock *l;
    Guard(Lock *l) : l{l} { AcquireSRWLockExclusive(l); }
    ~Guard()              { ReleaseSRWLockExclusive(l); }
};

struct RGuard {
    Lock *l;
    RGuard(Lock *l) : l{l} { AcquireSRWLockShared(l); }
    ~RGuard()              { ReleaseSRWLockShared(l); }
};

Dead simple. (What about rule of three? Instead of working around this language design flaw, reach into the distant future where it’s been fixed: -Werror=deprecated-copy-dtor.) Usage might look like:

struct Example {
    Lock lock = {};
    i32  value;
};

i32 incr(Example *e)
{
    Guard g(&e->lock);
    return ++e->value;
}

Note the = {} to guarantee the lock is always ready for use. It gets more interesting with condition variables in the mix. That’s three more functions:

W32(b32,  SleepConditionVariableSRW(Cond *, Lock *, i32, b32));
W32(void, WakeAllConditionVariable(Cond *));
W32(void, WakeConditionVariable(Cond *));

The last parameter on SleepConditionVariableSRW indicates if the lock was acquired shared. Why do locks have distinct acquire and release functions while condition variables use a flag for the same purpose? Beats me. I’ll unfold it into two functions, selected by type, with a default infinite wait:

b32 wait(Cond *c, Guard *g, i32 ms = -1)
{
    return SleepConditionVariableSRW(c, g->l, ms, 0);
}

b32 wait(Cond *c, RGuard *g, i32 ms = -1)
{
    return SleepConditionVariableSRW(c, g->l, ms, 1);
}

Usage might look like:

for (RGuard g(&lock); remaining;) {
    wait(&done, &g);
}

The other side is nothing more than a rename (but could also be accomplished through linking):

void signal(Cond *c)
{
    WakeConditionVariable(c);
}

void broadcast(Cond *c)
{
    WakeAllConditionVariable(c);
}

And a couple examples of its usage:

if (Guard g(&lock); !--remaining) {
    signal(&done);
}

// Or:

Guard g(&lock);
ready = true;
broadcast(&init);
while (remaining) {
    wait(&done, &g);
}

A satisfying, powerful synchronization interface with hardly any code!

Giving C++ std::regex a C makeover

2024-09-04T17:15:07Z

Suppose you’re working in C using one of the major toolchains — that is, it’s mainly a C++ implementation — and you need regular expressions. You could integrate a library, but there’s a regex implementation in the C++ standard library included with your compiler, just within reach. As a resourceful engineer, using an asset already in hand seems prudent. But it’s a C++ interface, and you’re using C instead of C++ for a reason, perhaps to avoid dealing with C++. Have no worries. This article is about wrapping std::regex in a tidy C interface which not only hides all the C++ machinery, but utterly tames it. It’s not so much practical as a potpourri of interesting techniques.

If you’d like to skip ahead, here’s the full source up front. Tested with w64devkit, MSVC cl, and clang-cl: scratch/regex-wrap

Interface design

The C interface I came up with, regex.h:

#pragma once
#include 

#define S(s) (str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} str;

typedef struct {
    char *beg;
    char *end;
} arena;

typedef struct regex regex;

typedef struct {
    str      *data;
    ptrdiff_t len;
} strlist;

regex  *regex_new(str, arena *);
strlist regex_match(regex *, str, arena *);

Longtime readers will find it familiar: my favorite non-owning, counted strings form in place of null-terminated strings — similar to C++ std::string_view — and arena allocation. Yes, such fundamental types wouldn’t “belong” to a regex library like this, but imagine they’re standardized by the project or whatever. Also, this is purely a C header, not a C/C++ polyglot, and will not be used by the C++ portion.

In particular note the lack of “free” functions. The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc. So in a sense, it could be called a non-allocating library. This requires a bit of C++ abuse: I will not call some C++ regex destructors. It shouldn’t matter because they only redundantly manage memory in the arena. (If regex objects are holding file handles or something else unnecessary then its implementation so poor as to not be worth using, and we should just use a better regex library.)

Now’s a good time to mention a caveat: In order to pull this off the regex library lives in its own Dynamic-Link Library with its own copy of the C++ standard library, i.e. statically linked. My demo is Windows-only, but this concept theoretically extends to shared objects on Linux. Since it’s a C interface that doesn’t expose standard library objects, the DLL can be used by programs compiled with different toolchains. Though that wouldn’t apply to my inciting hypothetical.

Example usage:

regex  *re = regex_new(S("(\\w+)"), perm);
str     s  = S("Hello, world! This is a test.");
strlist m  = regex_match(re, s, perm);
for (ptrdiff_t i = 0; i < m.len; i++) {
    printf("%2td = %.*s\n", i, (int)m.data[i].len, m.data[i].data);
}

This program prints:

= Hello
= world
= This
= is
= a
= test

If matching lots of source strings, scope the arena to the loop and then the results, and any regex working memory, are automatically freed in O(1) at the end of each iteration:

for (ptrdiff_t i = 0; i < ninputs; i++) {
    arena   scratch = *perm;
    strlist matches = regex_match(re, inputs[i], &scratch);
    // ... consume matches ...
}

C++ implementation

On the C++ side the first thing I do is replace new and delete, which is how I force it to allocate from the arena. This replaces new/delete for globally, but recall that the regex library has its own, private C++ implementation. Replacements apply only to itself even if there’s other C++ present in the process. If this is the only C++ in the process then it doesn’t require such careful isolation.

I can’t tell std::regex about the arena — it calls operator new the usual way, without extra arguments — so I have to smuggle it in through a thread-local variable:

static thread_local arena *perm;

If I’m sure the library is only used by a single thread then I can omit thread_local, but it’s useful here to demonstrate and measure. Using it in my operator replacements:

void *operator new(size_t size, std::align_val_t align)
{
    arena    *a     = perm;
    ptrdiff_t ssize = size;
    ptrdiff_t pad   = (uintptr_t)a->end & ((int)align - 1);
    if (ssize < 0 || ssize > a->end - a->beg - pad) {
        throw std::bad_alloc{};
    }
    return a->end -= size + pad;
}

void *operator new(size_t size)
{
    return operator new(
        size,
        std::align_val_t(__STDCPP_DEFAULT_NEW_ALIGNMENT__)
    );
}

Starting in C++17, replacing the global allocator requires definitions for both plain new/delete and aligned new/delete. The many other variants, including arrays, call these four and so may be skipped. Allocating over-aligned objects isn’t a special case for arenas, so I implemented plain new by calling aligned new. I’d prefer to allocate through a template so that I can “see” the type, but that’s not an option in this case.

After converting to signed sizes because they’re simpler, it’s the usual from-the-end allocation. I prefer -fno-exceptions but std::regex is inherently exceptional — and I mean that in at least two bad ways — so they’re required. The good news is this library gracefully and reliably handles out-of-memory errors. (The arena makes this trivial to test, so try it for yourself!)

I added a little extra flair replacing delete:

void operator delete(void *) noexcept {}
void operator delete(void *, std::align_val_t) noexcept {}

void operator delete(void *p, size_t size) noexcept
{
    arena *a = perm;
    if (a->end == (char *)p) {
        a->end += size;
    }
}

The two mandatory replacements are no-ops because that’s simply how arenas work. We don’t free individual objects, but many at once. It’s completely optional, but I also replaced sized delete for little other reason than sized deallocation is cool. C++ destructs in reverse order, so this is likely to work out. At least with GCC libstdc++, it freed about a third of the workspace memory before returning to C. I’d rather it didn’t try to free anything at all, but since it’s going to call delete anyway I can get some use out of it.

Interesting side note: In a rough benchmark these replacements made MSVC std::regex matching four times faster! I expected a small speedup, but not that. In the typical case it appears to be wasting most of its time on allocation. On the other hand, libstdc++ std::regex is overall quite a bit slower than MSVC, and my replacements had no performance effect. It’s spending its time elsewhere, and the small gains are lost interacting with the thread-local.

Finally the meat:

extern "C" std::regex *regex_new(str re, arena *a)
{
    perm = a;
    try {
        return new std::regex(re.data, re.data+re.len);
    } catch (...) {
        return {};
    }
}

It sets the thread-local to the arena, then constructs with “iterators” at each end of the input. All exceptions are caught and turned into a null return. Depending on need, we may want to indicate why it failed — out of memory, invalid regex, etc. — by returning an error value of some sort. An exercise for the reader.

The matcher is a little more complicated:

extern "C" strlist regex_match(std::regex *re, str s, arena *a)
{
    perm = a;
    try {
        std::cregex_iterator it(s.data, s.data+s.len, *re);
        std::cregex_iterator end;

        strlist r = {};
        r.len  = std::distance(it, end);
        r.data = new str[r.len]();
        for (ptrdiff_t i = 0; it != end; it++, i++) {
            r.data[i].data = s.data + it->position();
            r.data[i].len  = it->length();
        }
        return r;

    } catch (...) {
        return {};
    }
}

I create a char * “cregex” iterator, again giving it each end of the input. I hope it’s not just making a copy (MSVC std::regex does grumble grumble). The result is allocated out of the arena. As before, exceptions convert to a null return. Callers can distinguish errors because no-match results have a non-null pointer. The iterator, being a local variable, is destroyed before returning, uselessly calling delete. I could avoid this by allocating it with new, but in practice it doesn’t matter.

You might have noticed the lack of declspec(dllexport). DEF files are great, and I’ve come to appreciate and prefer them. GCC and MSVC accept them as another input on the command line, and the source need not be aware exports. My regex.def:

LIBRARY regex
EXPORTS
regex_new
regex_match

In w64devkit, the command to build the DLL:

$ g++ -shared -std=c++17 -o regex.dll regex.cpp regex.def

The MSVC command almost maps 1:1 to the GCC command:

$ cl /LD /std:c++17 /EHsc regex.cpp regex.def

In either case only the C interface is exported (via peports):

$ peports -e regex.dll
EXPORTS
        1       regex_match
        2       regex_new

Reasons against

Though this library is conveniently on hand, and my minimalist C wrapper interface is nicer than a typical C regex library interface, and even hides some std::regex problems, trade-offs must be considered:

No Unicode support, particularly UTF-8
std::regex implementations are universally poor and slow
libstdc++ std::regex is especially slow to compile
Isolating in a DLL (if needed) is inconvenient
DLL is 200K (MSVC) to 700K (GCC) or so

Depending on what I’m doing, some of these may have me looking elsewhere.

Arenas and the almighty concatenation operator

2024-05-25T00:00:00Z

I continue to streamline an arena-based paradigm, and stumbled upon a concise technique for dynamic growth — an efficient, generic “concatenate anything to anything” within an arena built atop a core of 9-ish lines of code. The key insight originated from a reader suggestion about dynamic arrays. The subject of concatenation can be a string, dynamic array, or even something else. The “system” is extensible, and especially useful for path handling.

Continuing from last time, the examples are in light, C-style C++. I chose it because templates and function overloading express the concepts succinctly. It uses no standard library functionality, so converting to C, or similar, should be straightforward. The core concatenation “operator”:

template<typename T>
T concat(arena *a, T head, T tail)
{
    if ((char *)(head.data+head.len) != a->beg) {
        head = T{a, head};
    }
    head.len += T{a, tail}.len;
    return head;
}

This concatenates two objects of the same type in the arena, and does so in place if possible. That is, we can efficiently build a value piece by piece. The type T must have data and len members, and a “copy” constructor that makes a copy of the given object at the front of the arena. Size integer overflows and out-of-memory errors are, as usual, handled by the arena. In particular, note that the len addition happens after allocation.

Since the front-of-the-arena business implicit, consider asserting it if you’re worried. I’ve also considered declaring a clone “operator” where that behavior is an explicit part of its interface.

// Make a copy of the object at the front of the arena.
template<typename T> T clone(arena *, T);

// In concat, replace the T{} constructors with clone:
    head = clone(a, head);
    head.len += clone(a, tail).len;

Strings are perhaps them most interesting subject of concatenation. Here’s a compatible string, str, definition from my previous article:

struct str {
    union {
        uint8_t    *data = 0;
        char const *cdata;
    };
    ptrdiff_t len = 0;

    str() = default;

    str(uint8_t *beg, uint8_t *end) : data{beg}, len{end-beg} {}

    template<ptrdiff_t N>
    constexpr str(char const (&s)[N]) : cdata{s}, len{N-1} {}

    str(arena *, str);  // TODO

    uint8_t &operator[](ptrdiff_t i) { return data[i]; }
};

This has data, len, and the necessary constructor declaration. Before showing the constructor definition, here’s an arena following the usual formula, which should be familiar to those who’ve been following along:

struct arena {
    char *beg;
    char *end;
};

template<typename T, typename ...A>
T *makefront(ptrdiff_t count, arena *a, A ...args)
{
    ptrdiff_t size  = sizeof(T);
    ptrdiff_t align = -(uintptr_t)a->beg & (alignof(T) - 1);
    assert(count < (a->end - a->beg - align)/size);  // OOM
    T *r = (T *)(a->beg + align);
    a->beg += align + size*count;
    for (ptrdiff_t i = 0; i < count; i++) {
        new (r+i) T(args...);
    }
    return r;
}

Note how it bumps beg, not end, because it’s allocated at the front. That opens the end of the object for concatenation. When it returns, beg points just past the end of the new object, aligned to it. Later, concat inspects beg to see if it can extend in place. That will be true if nothing else has been allocated at the front in the meantime. That is, we can allocate objects at the end — such as hash map nodes — while efficiently growing an object at the front through concatenation. If it’s not true for whatever reason, concatenation still works, just with reduced efficiency.

With that out of the way, the “copy” constructor is simple:

str::str(arena *a, str s)
{
    data = makefront<uint8_t>(s.len, a);
    len = s.len;
    for (ptrdiff_t i = 0; i < len; i++) {
        data[i] = s[i];
    }
}

That’s everything we need to put it into action. For example, a function that deletes a file at a path following a path template.

char *tocstr(arena *a, str s)
{
    return (char *)concat(a, s, str{"\0"}).data;
}

bool removeconfig(str home, str program, arena scratch)
{
    str path = {};
    path = concat(&scratch, path, home);
    path = concat(&scratch, path, str{"/.config/"});
    path = concat(&scratch, path, program);
    path = concat(&scratch, path, str{"/rc"});
    return !unlink(tocstr(&scratch, path));
}

First, concat does all the heavy lifting in a null-terminated “C string” conversion function that operates in place if possible. In removeconfig I construct a path from path components, starting from a zero-initialized null string. In the first concat, this null string is “copied” into the arena, laying a foundation for additional concatenations. Each path component is copied in place, so unlike a dumb strcat, it’s not quadratic.

Even more, notice it supports arbitrary path lengths. No PATH_MAX, MAX_PATH, etc., it grows into the arena as needed. No huge stack variables necessary, and the scratch arena automatically frees the path on return. Fancier yet, imagine a variadic function that glues path components together with the proper path delimiter, and it wouldn’t involve a single, error-prone size calculation.

The str{} business is unfortunate. The char array constructor normally kicks in in these situations, but compilers can’t resolve the template without an explicit str object. Perhaps there’s a workaround, but I’m not yet savvy enough with C++ to figure it out. In the C version you’d always need to wrap those literals in the string macro.

Extending concatenation

The “operator” can be extended by defining more overloads. For example, to concatenate 32-bit integers to a string:

str concat(arena *a, str s, int32_t x)
{
    uint8_t  buf[16];
    uint8_t *end = buf + countof(buf);
    uint8_t *beg = end;
    int32_t  neg = x<0 ? x : -x;
    do {
        *--beg = '0' - neg%10;
    } while (neg /= 10);
    if (x < 0) {
        *--beg = '-';
    }
    return concat(a, s, {beg, end});
}

Now we can, say, construct a randomly-generated temporary path:

str path = {};
path = concat(&scratch, path, tempdir);
path = concat(&scratch, path, str{"/temp"});
int32_t id = rand32(&rng);
path = concat(&scratch, path, id);

Keep adding more definitions like this and you’ll have something like, or complementing, buffered output. It doesn’t stop there. Code points concatenated as UTF-8:

str concat(arena *a, str s, char32_t rune)
{
    enum { REPLACEMENT_CHARACTER = 0xfffd };
    if (rune>=0xd800 && rune<=0xdfff) {
        rune = REPLACEMENT_CHARACTER;
    }

    uint8_t  buf[4];
    uint8_t *end = 0;
    if (rune < 0x80) {
        buf[0] = rune;
        end = buf + 1;
    } else if (rune < 0x800) {
        buf[0] =  (rune >>  6)         | 0xc0;
        buf[1] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 2;
    } else if (rune < 0x10000) {
        buf[0] =  (rune >> 12)         | 0xe0;
        buf[1] = ((rune >>  6) & 0x3f) | 0x80;
        buf[2] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 3;
    } else {
        buf[0] =  (rune >> 18)         | 0xf0;
        buf[1] = ((rune >> 12) & 0x3f) | 0x80;
        buf[2] = ((rune >>  6) & 0x3f) | 0x80;
        buf[3] = ((rune >>  0) & 0x3f) | 0x80;
        end = buf + 4;
    }
    return concat(a, s, {buf, end});
}

That composes well for general UTF-8 handling. For example, to ingest Win32 strings (arguments, paths, etc.):

str convert(arena *perm, char16_t *s)
{
    str r = {};
    while (*s) {
        char32_t rune = decode(&s);
        r = concat(perm, r, rune);
    }
    return r;
}

Beyond strings

One of my most useful C++ templates has been a span structure:

template<typename T>
struct span {
    T        *data = 0;
    ptrdiff_t len  = 0;

    span() = default;

    span(T *beg, T *end) : data{beg}, len{end-beg} {}

    span(arena *, span);  // for concat

    T &operator[](ptrdiff_t i) { return data[i]; }
};

The span::span definition looks exactly like str::str. In fact, we could nearly define strings as uint8_t spans:

typedef span<uint8_t> str;  // hypothetical

Though I’ve found strings to be just special enough not to be worth it.

This span definition is now fleshed out sufficiently to use concat with no additional definitions! However, outside of strings, concatenating spans is unusual. More often we want to append individual elements. Again, we can build on that core concat template:

template<typename T>
span<T> concat(arena *a, span<T> s, T v)
{
    return concat(a, s, span{&v, &v+1});
}

Now span is ready for 99% of its use cases. For example:

    span<int32_t> squares;
    for (int32_t i = 1; i <= 1000; i++) {
        squares = concat(&scratch, squares, i*i);
    }

It’s often good enough, but it’s not ideal as a general purpose dynamic array. Each append makes a trip through arena allocation, and this span cannot efficiently shrink and then grow again. Sometimes we’d like to track capacity, covering both those cases.

template<typename T>
struct list {
    T        *data = 0;
    ptrdiff_t len  = 0;
    ptrdiff_t cap  = 0;

    list() = default;

    list(arena *, list);  // for concat

    T &operator[](ptrdiff_t i) { return data[i]; }
};

Unfortunately cap is a curve ball that the core template can’t handle, requiring a slightly more complex definition. Since concatenating whole list objects is unusual, a definition for appending single elements:

template<typename T>
list<T> concat(arena *a, list<T> s, T v)
{
    if (s.len == s.cap) {
        if ((char *)(s.data+s.len) != a->beg) {
            s = list<T>{a, s};
        }
        ptrdiff_t extend = s.cap ? s.cap : 4;
        makefront<T>(extend, a);
        s.cap += extend;
    }
    s[s.len++] = v;
    return s;
}

Note how inside the if it’s basically the same core definition. As before, this definition extends in place if possible, but otherwise handles it correctly anyway. In addition the above concerns, this list is more suited to having multiple “open” dynamic arrays at once.

This concatenative concept has been a useful way to think about a variety of situations in order to solve them effectively with arena allocation.

Update: NRK sharply points out that “extend in place” as expressed in concat is incompatible with the alloc_size and malloc GCC function attributes, which I’ve suggested in the past. While considering how to mitigate this, we’ve also discovered that alloc_size has always been fundamentally broken in GCC. Correct use is impossible, and so it must not be used.

Guidelines for computing sizes and subscripts

2024-05-24T22:25:10Z

Occasionally we need to compute the size of an object that does not yet exist, or a subscript that may fall out of bounds. It’s easy to miss the edge cases where results overflow, creating a nasty, subtle bug, even in the presence of type safety. Ideally such computations happen in specialized code, such as inside an allocator (calloc, reallocarray) and not outside by the allocatee (i.e. malloc). Mitigations exist with different trade-offs: arbitrary precision, or using a wider fixed integer — i.e. 128-bit integers on 64-bit hosts. In the typical case, working only with fixed size-type integers, I’ve come up with a set of guidelines to avoid overflows in the edge cases.

Range check before computing a result. No exceptions.
Do not cast unless you know a priori the operand is in range.
Never mix unsigned and signed operands. Prefer signed. If you need to convert an operand, see (2).
Do not add unless you know a priori the result is in range.
Do not multiply unless you know a priori the result is in range.
Do not subtract unless you know a priori both signed operands are non-negative. For unsigned, that the second operand is not larger than the first (treat it like (4)).
Do not divide unless you know a prior the denominator is positive.
Make it correct first. Make it fast later, if needed.

These guidelines are also useful when reviewing code, tracking in your mind whether the invariants are held at each step. If not, you’ve likely found a bug. If in doubt, use assertions to document and check invariants. I compiled this list during code review, so for me that’s where it’s most useful.

Range check, then compute

Not strictly necessary when overflow is well-defined, i.e. wraparound, but it’s like defensive driving. It’s simpler and clearer to check with basic arithmetic rather than reason from a wraparound, i.e. a negative result. Checked math functions are fine, too, if you check the overflow boolean before accessing the result.

// bad
len++;
if (len <= 0) error();

// good
if (len == MAX) error();
len++;

Casting

Casting from signed to unsigned, it’s as simple as knowing the value is non-negative, which is likely if you’re following (1). If a negative size has appeared, there’s already been a bug earlier in the program, and the only reasonable course of action is to abort, not handle it like an error.

Addition

To check if addition will overflow, subtract one of the operands from the maximum value.

if (b > MAX - a) error();
r = a + b;

In pointer arithmetic addition, it’s a common mistake to compute the result pointer then compare it to the bounds. If the check failed, then the pointer already overflowed, i.e. undefined behavior. Major pieces software, like glibc, are riddled with such pointer overflows. (Now that you’re aware of it, you’ll start noticing it everywhere. Sorry.)

// bad: never do this
beg += size;
if (beg > end) error();

To do this correctly, check integers not pointers. Like before, subtract before adding.

available = end - beg;
if (size > available) error();
beg += size;

Mind mixing signed and unsigned operands for the comparison operator (3), e.g. an unsigned size on the left and signed difference on the right.

Multiplication and division

If you’re working this out on your own, multiplication seems tricky until you’ve internalized a simple pattern. Just as we subtracted before adding, we need to divide before multiplying. Divide the maximum value by one of the operands:

if (a>0 && b>MAX/a) error();
r = a * b;

It’s often permitted for one or both to be zero, so mind divide-by-zero, which is handled above by the first condition. Sometimes size must be positive, e.g. the result of the sizeof operator in C, in which case we should prefer it as the denominator.

assert(size  >  0);
assert(count >= 0);
if (count > MAX/size) error();
total = count * size;

With arena allocation there are usually two concerns. First, will it overflow when computing the total size, i.e. count * size? Second, is the total size within the arena capacity. Naively that’s two checks, but we can kill two birds with one stone: Check both at once by using the current arena capacity as the maximum value when considering overflow.

if (count > (end - beg)/size) error();
total = count * size;

One condition pulling double duty.

Subtraction

With signed sizes, the negative range is a long “runway” allowing a single unchecked subtraction before overflow might occur. In essence, we were exploiting this in order to check addition. The most common mistake with unsigned subtraction is not accounting for overflow when going below zero.

// note: signed "i" only
for (i = end - stride; i >= beg; i -= stride) ...

This loop will go awry if i is unsigned and beg <= stride.

In special cases we can get away with a second subtraction without an overflow check if we know some properties of our operands. For example, my arena allocators look like this:

padding = -beg & (align - 1);
if (count >= (end - beg - padding)/size) error();

That’s two subtractions in a row. However, end - beg describes the size of a realized object, and align is a small constant (e.g. 2^(0–6)). It could only overflow if the entirety of memory was occupied by the arena.

Bonus, advanced note: This check is actually pulling triple duty. Notice that I used >= instead of >. The arena can’t fill exactly to the brim, but it handles the extreme edge case where count is zero, the arena is nearly full, but the bump pointer is unaligned. The result of subtracting padding is negative, which rounds to zero by integer division, and would pass a > check. That wouldn’t be a problem except that aligning the bump pointer would break the invariant beg <= end.

Try it for yourself

Next time you’re reviewing code that computes sizes or subscripts, bring the list up and see how well it follows the guidelines. If it misses one, try to contrive an input that causes an overflow. If it follows guidelines and you can still contrive such an input, then perhaps the list could use another item!

Speculations on arenas and custom strings in C++

2024-04-14T00:39:18Z

Update September 2025: This article has a followup with corrections.

My techniques with arena allocation and strings are oriented around C. I’m always looking for a better way, and lately I’ve been experimenting with building them using C++ features. What are the trade-offs? Are the benefits worth the costs? In this article I lay out my goals, review implementation possibilities, and discuss my findings. Following along will require familiarity with those previous two articles.

Some of C++ is beyond my mental capabilities, and so I cannot wield those parts effectively. Other parts I can wrap my head around, but it requires substantial effort and the inevitable mistakes are difficult to debug. So a general goal is to minimize contact with that complexity, only touching a few higher-value features that I can use confidently.

Existing practice is unimportant. I’ve seen where that goes. Like the C standard library, the C++ standard library offers me little. Its concepts regarding ownership and memory management are irreconcilable (move semantics, smart pointers, etc.), so I have to build from scratch anyway. So absolutely no including C++ headers. The most valuable features are built right into the language, so I won’t need to include library definitions.

No public or private. Still no const beyond what is required to access certain features. This means I can toss out a bunch of keywords like class, friend, etc. It eliminates noisy, repetitive code and interfaces — getters, setters, separate const and non-const — which in my experience means fewer defects.

No references beyond mandatory cases. References hide addresses being taken — or merely implies it, when it’s actually an expensive copy — which is an annoying experience when reading unfamiliar C++. After all, for arenas the explicit address-taking (permanent) or copying (scratch) is a critical part of communicating the interfaces.

In theory constexpr could be useful, but it keeps falling short when I try it out, so I’m ignoring it. I’ll elaborate in a moment.

Minimal template use. They blow up compile times and code size, they’re noisy, and in practice they make debug builds (i.e. -O0) much slower (typically ~10x) because there’s no optimization to clean up the mess. I’ll only use them for a few foundational purposes, such as allocation. (Though this article is about the fundamental stuff.)

No methods aside from limited use of operator overloads. I want to keep a C style, plus methods just look ugly without references: obj->func() vs. func(obj). (Why are we still writing -> in the 21st century?) Function overloading can instead differentiate “methods.” Overloads are acceptable in moderation, especially because I’m paying for it (symbol decoration) whether or not I take advantage.

Finally, no exceptions of course. I assume -fno-exceptions, or the local equivalent, is active.

Allocation

Let’s start with allocation. Since writing that previous article, I’ve streamlined arena allocation in C:

#define new(a, t, n)  (t *)alloc(a, sizeof(t), _Alignof(t), n)

typedef struct {
    byte *beg;
    byte *end;
} arena;

static byte *alloc(arena *a, size objsize, size align, size count)
{
    assert(count >= 0);
    size pad = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    return memset(a->end -= objsize*count + pad, 0, objsize*count);
}

(As needed, replace the second assert with whatever out of memory policy is appropriate.) Then allocating, say, a 10k-element hash table (i.e. to keep it off the stack):

    i16 *seen = new(&scratch, i16, 1<<14);

With C++, I initially tried placement new with the arena as the “place” for the allocation:

void *operator new(size_t, arena *);  // avoid this

Then to create a single object:

    object *o = new (&scratch) object{};

This exposes the constructor, but everything else about it is poor. It relies on complex, finicky rules governing new overloads, especially for alignment handling. It’s difficult to tell what’s happening, and it’s too easy to make mistakes that compile. That doesn’t even count the mess that is array new[].

I soon learned it’s better to replace the new macro with a template, which can actually see what it’s doing. I can’t call it new in C++, so I settled on make instead:

template<typename T>
static T *make(arena *a, size count = 1)
{
    assert(count >= 0);
    size objsize = sizeof(T);
    size align   = alignof(T);
    size pad     = (uptr)a->end & (align - 1);
    assert(count < (a->end - a->beg - pad)/objsize);  // oom
    a->end -= objsize*count + pad;
    T *r = (T *)a->end;
    for (size i = 0; i < count; i++) {
        new ((void *)&r[i]) T{};
    }
    return r;
}

Then allocating that hash table becomes:

    i16 *seen = make<i16>(&scratch, 10000);

Or a single object, relying on the default argument:

    object *o = make<object>(&scratch);

Due to placement new, merely for invoking the constructor, these objects aren’t just zero-initialized, but value-initialized. It can only construct objects that define an empty initializer, but in exchange unlocks some interesting possibilities:

struct mat3 {
    f32 data[9] = {
        1, 0, 0,
        0, 1, 0,
        0, 0, 1,
    };
};

struct list {
    node  *head = 0;
    node **tail = &head;
};

When a zero-initialized state isn’t ideal, objects can still initialize to a more useful state straight out of the arena. The second case is even self-referencing, which is specifically supported through placement new. Otherwise you’d need a special-written copy or move constructor.

make could accept constructor arguments and perfect forward them to a constructor. However, that’s too far into the dark arts for my comfort, plus it requires a correct definition of std::forward. In practice that means #include-ing it, and whatever comes in with it. Or ask an expert capable of writing such a definition from scratch, though both are probably too busy.

Update 1: One of those experts, Jonathan Müller, kindly reached out to say that a static cast is sufficient. This is easy to do:

template<typename T, typename ...A>
static T *make(arena *a, size count = 1, A &&...args)
{
    // ...
        new ((void *)&r[i]) T{(A &&)args...};
    // ...
}

Update 2: I later realized that because I do not care about copy or move semantics, I also don’t care about perfect forwarding. I can simply expand the parameter pack without casting or &&. I also don’t want the extra restrictions on braced initializer conversions, so better to use parentheses with new.

template<typename T, typename ...A>
static T *make(arena *a, size count = 1, A ...args)
{
    // ...
        new ((void *)&r[i]) T(args...);
    // ...
}

One small gotcha: placement new doesn’t work out of the box, and you need to provide a definition. That means including or writing one out. Fortunately it’s trivial, but the prototype must exactly match, including size_t:

void *operator new(size_t, void *p) { return p; }

Overall I feel the template is a small improvement over the macro.

Strings

Recall my basic C string type, with a macro to wrap literals:

#define countof(a)  (size)(sizeof(a) / sizeof(*(a)))
#define s8(s)       (s8){(u8 *)s, countof(s)-1}

typedef struct {
    u8  *data;
    size len;
} s8;

Since it doesn’t own the underlying buffer — region-based allocation has already solved the ownership problem — this is what C++ long-windedly calls a std::string_view. In C++ we won’t need the countof macro for strings, but it’s still generally useful. Converting it to a template, which is theoretically more robust (rejects pointers), but comes with a non-zero cost:

template<typename T, size N>
size countof(T (&)[N])
{
    return N;
}

The reference — here a reference to an array — is unavoidable, so it’s one of the rare cases. The same concept applies as an s8 constructor to replace the macro:

struct s8 {
    u8  *data = 0;
    size len  = 0;

    s8() = default;

    template<size N>
    s8(const char (&s)[N]) : data{(u8 *)s}, len{N-1} {}
};

I’ve explicitly asked to keep a default zero-initialized (empty) string since it’s useful — and necessary to directly allocate strings using make, e.g. an array of strings. const is required because string literals are const in C++, but it’s immediately stripped off for the sake of simplicity. The new constructor allows:

    s8 version = "1.2.3";

Or even more usefully:

    void print(bufout *, s8);
    // ...
    print(stdout, "hello world\n");

Define operator== and it’s more useful yet:

    b32 operator==(s8 s)
    {
        return len==s.len && (!len || !memcmp(data, s.data, len));
    }

Now this works, and it’s cheap and fast even in debug builds:

    s8 key = ...;
    if (key == "HOME") {
        // ...
    }

That’s more ergonomic than the macro and comparison function. operator[] also improves ergonomics, to subscript a string without going through the data member:

    u8 &operator[](size i)
    {
        assert(i >= 0);
        assert(i < len);
        return data[i];
    }

The reference is again necessary to make subscripts assignable. Since s8span — make a string spanning two pointers — so often appears in my programs, a constructor seems appropriate, too:

    s8(u8 *beg, u8 *end)
    {
        assert(beg <= end);
        data = beg;
        len = end - beg;
    }

By the way, these assertions I’ve been using are great for catching mistakes quickly and early, and they complement fuzz testing.

I’m not sold on it, but an idea for the future: C++23’s multi-index operator[] as a slice operator:

    s8 operator[](size beg, size end)
    {
        assert(beg >= 0);
        assert(beg <= end);
        assert(end <= len);
        return {data+beg, data+end};
    }

Then:

    s8 msg = "foo bar baz";
    msg = msg[4,7];  // msg = "bar"

I could keep going with, say, iterators and such, but each will be more specialized and less useful. (I don’t care about range-based for loops.)

Downside: static initialization

The new string stuff is neat, but I hit a wall trying it out: These fancy constructors do not reliably construct at compile time, not even with a constexpr qualifier in two of the three major C++ implementations. A static lookup table that contains a string is likely constructed at run time in at least some builds. For example, this table:

static s8 keys[] = {"foo", "bar", "baz"};

Requires run-time construction in real world cases I care about, requiring C++ magic and linking runtime gunk. The constructor is therefore a strict downgrade from the macro, which works perfectly in these lookup tables. Once a non-default constructor is defined, I’ve been unable to find an escape hatch back to the original, dumb, reliable behavior.

Update: Jonathan Müller points out the reinterpret cast is forbidden in a constexpr function, so it’s not required to happen at compile time. After some thought, I’ve figured out a workaround using a union:

struct s8 {
    union {
        u8         *data = 0;
        const char *cdata;
    };
    size len = 0;

    template<size N>
    constexpr s8(const char (&s)[N]) : cdata{s}, len{N-1} {}

    // ...
}

In all three C++ implementations, in all configurations, this reliably constructs strings at compile time. The other semantics are unchanged.

Other features

Having a generic dynamic array would be handy, and more ergonomic than my dynamic array macro:

template<typename T>
struct slice {
    T   *data = 0;
    size len  = 0;
    size cap  = 0;

    slice<T> = default;

    template<size N>
    slice<T>(T (&a)[N]) : data{a}, len{N}, cap{N} {}

    T &operator[](size i) { ... }
}

template<typename T>
slice<T> append(arena *, slice<T>, T);

On the other hand, hash maps are mostly solved, so I wouldn’t bother with a generic map.

Function overloads would simplify naming. For example, this in C:

prints8(bufout *, s8);
printi32(bufout *, i32);
printf64(bufout *, f64);
printvec3(bufout *, vec3);

Would hide that stuff behind the scenes in the symbol decoration:

print(bufout *, s8);
print(bufout *, i32);
print(bufout *, f64);
print(bufout *, vec3);

Same goes for a hash() function on different types.

C++ has better null pointer semantics than C. Addition or subtraction of zero with a null pointer produces a null pointer, and subtracting null pointers results in zero. This eliminates some boneheaded special case checks required in C, though not all: memcpy, for instance, arbitrarily still does not accept null pointers even in C++.

Ultimately worth it?

The static data problem is a real bummer, but perhaps it’s worth it for the other features. I still need to put it all to the test in a real, sizable project.

An improved chkstk function on Windows

2024-02-05T17:56:05Z

If you’ve spent much time developing with Mingw-w64 you’ve likely seen the symbol ___chkstk_ms, perhaps in an error message. It’s a little piece of runtime provided by GCC via libgcc which ensures enough of the stack is committed for the caller’s stack frame. The “function” uses a custom ABI and is implemented in assembly. So is the subject of this article, a slightly improved implementation soon to be included in w64devkit as libchkstk (-lchkstk).

The MSVC toolchain has an identical (x64) or similar (x86) function named __chkstk. We’ll discuss that as well, and w64devkit will include x86 and x64 implementations, useful when linking with MSVC object files. The new x86 __chkstk in particular is also better than the MSVC definition.

A note on spelling: ___chkstk_ms is spelled with three underscores, and __chkstk is spelled with two. On x86, cdecl functions are decorated with a leading underscore, and so may be rendered, e.g. in error messages, with one fewer underscore. The true name is undecorated, and the raw symbol name is identical on x86 and x64. Further complicating matters, libgcc defines a ___chkstk with three underscores. As far as I can tell, this spelling arose from confusion regarding name decoration, but nobody’s noticed for the past 28 years. libgcc’s x64 ___chkstk is obviously and badly broken, so I’m sure nobody has ever used it anyway, not even by accident thanks to the misspelling. I’ll touch on that below.

When referring to a particular instance, I will use a specific spelling. Otherwise the term “chkstk” refers to the family. If you’d like to skip ahead to the source for libchkstk: libchkstk.S.

A gradually committed stack

The header of a Windows executable lists two stack sizes: a reserve size and an initial commit size. The first is the largest the main thread stack can grow, and the second is the amount committed when the program starts. A program gradually commits stack pages as needed up to the reserve size. Binutils objdump option -p lists the sizes. Typical output for a Mingw-w64 program:

$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000

The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB initially committed. With the Binutils linker, ld, you can set them at link time using --stack. Via gcc, use -Xlinker. For example, to reserve an 8MiB stack and commit half of it:

$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...

MSVC link.exe similarly has /stack.

The purpose of this mechanism is to avoid paying the commit charge for unused stack. It made sense 30 years ago when stacks were a potentially large portion of physical memory. These days it’s a rounding error and silly we’re still dealing with it. Using the above options you can choose to commit the entire stack up front, at which point a chkstk helper is no longer needed (-mno-stack-arg-probe, /Gs2147483647). This requires link-time control of the main module, which isn’t always an option, like when supplying a DLL for someone else to run.

The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.

In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:

void callee(char *);

void example(void)
{
    char large[1<<20];
    callee(large);
}

Compiled with 64-bit gcc -O:

example:
    movl    $1048616, %eax
    call    ___chkstk_ms
    subq    %rax, %rsp
    leaq    32(%rsp), %rcx
    call    callee
    addq    $1048616, %rsp
    ret

I used GCC, but this is practically identical to the code generated by MSVC and Clang. Note the call to ___chkstk_ms in the function prologue before allocating the stack frame (subq). Also note that it sets eax. As a volatile register, this would normally accomplish nothing because it’s done just before a function call, but recall that ___chkstk_ms has a custom ABI. That’s the argument to chkstk. Further note that it uses rax on the return. That’s not the value returned by chkstk, but rather that x64 chkstk preserves all registers.

Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.

Implementing chkstk

In a high level language, chkstk might look something like so:

// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
    volatile char frame[frame_size];  // NOTE: variable-length array
    for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
        frame[i] = 0;  // touch the guard page
    }
}

This wouldn’t work for a number of reasons, but if it did, volatile would serve two purposes. First, forcing the side effect to occur. The second is more subtle: The loop must happen in exactly this order, from high to low. Without volatile, loop iterations would be independent — as there are no dependencies between iterations — and so a compiler could reverse the loop direction.

The store can happen anywhere within the guard page, so it’s not necessary to align frame to the page. Simply touching at least one byte per page is enough. This is essentially the definition of libgcc ___chkstk_ms.

How many iterations occur? In example above, the stack frame will be around 1MiB (2²⁰). With pages of 4KiB (2¹²) that’s 256 iterations. The loop happens unconditionally, meaning every function call requires 256 iterations of this loop. Wouldn’t it be better if the loop ran only as needed, i.e. the first time? MSVC x64 __chkstk skips iterations if possible, and the same goes for my new ___chkstk_ms. Much like the command line string, the low address of the current thread’s guard page is accessible through the Thread Information Block (TIB). A chkstk can cheaply query this address, only looping during initialization or so. (In contrast to Linux, a thread’s stack is fundamentally managed by the operating system.)

Taking that into account, an improved algorithm:

Push registers that will be used
Compute the low address of the new stack frame (F)
Retrieve the low address of the committed stack (C)
Go to 7
Subtract the page size from C
Touch memory at C
If C > F, go to 5
Pop registers to restore them and return

A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:

___chkstk_ms:
    push %rax              // 1.
    push %rcx              // 1.
    neg  %rax              // 2. rax = frame low address
    add  %rsp, %rax        // 2. "
    mov  %gs:(0x10), %rcx  // 3. rcx = stack low address
    jmp  1f                // 4.
0:  sub  $0x1000, %rcx     // 5.
    test %eax, (%rcx)      // 6. page fault (very slow!)
1:  cmp  %rax, %rcx        // 7.
    ja   0b                // 7.
    pop  %rcx              // 8.
    pop  %rax              // 8.
    ret                    // 8.

I’ve labeled each instruction with its corresponding pseudo-code. Step 6 is unusual among chkstk implementations: It’s not a store, but a load, still sufficient to fault the page. That test instruction is just two bytes, and unlike other two-byte options, doesn’t write garbage onto the stack — which would be allowed — nor use an extra register. I searched through single byte instructions that can page fault, all of which involve implicit addressing through rdi or rsi, but they increment rdi or rsi, and would would require another instruction to correct it.

Because of the return address and two push operations, the low stack frame address is technically too low by 24 bytes. That’s fine. If this exhausts the stack, the program is really cutting it close and the stack is too small anyway. I could be more precise — which, as we’ll soon see, is required for x86 __chkstk — but it would cost an extra instruction byte.

On x64, ___chkstk_ms and __chkstk have identical semantics, so name it __chkstk — which I’ve done in libchkstk — and it works with MSVC. The only practical difference between my chkstk and MSVC __chkstk is that mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking the optimization, is libgcc ___chkstk_ms, weighing 50 bytes, or in practice, due to an unfortunate Binutils default of padding sections, 64 bytes.

I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.

Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.

32-bit chkstk

On x86 ___chkstk_ms has identical semantics to x64. Mine is a copy-paste of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC was ahead of the curve on this design.

However, x86 __chkstk is bonkers. It not only commits the stack, but also allocates the stack frame. That is, it returns with a different stack pointer. The return pointer is initially inside the new stack frame, so chkstk must retrieve it and return by other means. It must also precisely compute the low frame address.

__chkstk:
    push %ecx               // 1.
    neg  %eax               // 2.
    lea  8(%esp,%eax), %eax // 2.
    mov  %fs:(0x08), %ecx   // 3.
    jmp  1f                 // 4.
0:  sub  $0x1000, %ecx      // 5.
    test %eax, (%ecx)       // 6. page fault (very slow!)
1:  cmp  %eax, %ecx         // 7.
    ja   0b                 // 7.
    pop  %ecx               // 8.
    xchg %eax, %esp         // ?. allocate frame
    jmp  *(%eax)            // 8. return

The main differences are:

eax is treated as volatile, so it is not saved
The low frame address is precisely computed with lea (2)
The frame is allocated at step (?) by swapping F and the stack pointer
Post-swap F now points at the return address, so jump through it

MSVC x86 __chkstk does not query the TIB (3), and so unconditionally runs the loop. So there’s an advantage to my implementation besides size.

libgcc x86 ___chkstk has this behavior, and so it’s also a suitable __chkstk aside from the misspelling. Strangely, libgcc x64 ___chkstk also allocates the stack frame, which is never how chkstk was supposed to work on x64. I can only conclude it’s never been used.

Optimization in practice

Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.

_Bool path_contains(wchar_t *name, wchar *path)
{
    wchar_t var[1<<15];
    GetEnvironmentVariableW(name, var, countof(var));
    // ... search for path in var ...
}

int64_t getfilesize(char *path)
{
    wchar_t wide[1<<15];
    MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
    // ... look up file size via wide path ...
}

void example(void)
{
    if (path_contains(L"PATH", L"c:\\windows\\system32")) {
        // ...
    }

    int64_t size = getfilesize("π.txt");
    // ...
}

Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.

So ultimately after writing an improved ___chkstk_ms I could only measure a tiny difference in contrived programs, and none in any real application. Though there’s still one more benefit I haven’t yet mentioned…

“The first thing we do, let’s kill all the lawyers”.

My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.

Every once in awhile ___chkstk_ms rears its ugly head, and I have to make a decision. Do I re-work my code to avoid it? Do I take the reigns of the linker and disable stack probes? I haven’t necessarily allocated a large local array: A bit of luck with function inlining can combine several smaller stack frames into one that’s just large enough to require chkstk.

Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!

(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)

MSVC is worse. Hardly anyone follows its license, but fortunately for most the license is practically unenforced. Its chkstk, which currently resides in a loose chkstk.obj, falls into what Microsoft calls “Distributable Code.” Its license requires “external end users to agree to terms that protect the Distributable Code.” In other words, if you compile a program with MSVC, you’re required to have a EULA including the relevant terms from the Visual Studio license. You’re not legally permitted to distribute software in the manner of w64devkit — no installer, just a portable zip distribution — if that software has been built with MSVC. At least not without special care which nobody does. (Don’t worry, I won’t tell.)

How to use libchkstk

To avoid libgcc entirely you need -nostdlib. Otherwise it’s implicitly offered to the linker, and you’d need to manually check if it picked up code from libgcc. If ld complains about a missing chkstk, use -lchkstk to get a definition. If you use -lchkstk when it’s not needed, nothing happens, so it’s safe to always include.

I also recently added a libmemory to w64devkit, providing tiny, public domain definitions of memset, memcpy, memmove, memcmp, and strlen. All compilers fabricate calls to these five functions even if you don’t call them yourself, which is how they were selected. (Not because I like them. I really don’t.). If a -nostdlib build complains about these, too, then add -lmemory.

$ gcc -nostdlib ... -lchkstk -lmemory

In MSVC the equivalent option is /nodefaultlib, after which you may see missing chkstk errors, and perhaps more. libchkstk.a is compatible with MSVC, and link.exe doesn’t care that the extension is .a rather than .lib, so supply it at link time. Same goes for libmemory.a if you need any of those, too.

$ cl ... /link /nodefaultlib libchkstk.a libmemory.a

While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.

Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!

Two handy GDB breakpoint tricks

2024-01-28T21:56:07Z

Over the past couple months I’ve discovered a couple of handy tricks for working with GDB breakpoints. I figured these out on my own, and I’ve not seen either discussed elsewhere, so I really ought to share them.

Continuable assertions

The assert macro in typical C implementations leaves a lot to be desired, as does raise and abort, so I’ve suggested alternative definitions that behave better under debuggers:

#define assert(c)  while (!(c)) __builtin_trap()
#define assert(c)  while (!(c)) __builtin_unreachable()
#define assert(c)  while (!(c)) *(volatile int *)0 = 0

Each serves a slightly different purpose but still has the most important property: Immediately halt the program directly on the defect. None have an occasionally useful secondary property: Optionally allow the program to continue through the defect. If the program reaches the body of any of these macros then there is no reliable continuation. Even manually nudging the instruction pointer over the assertion isn’t enough. Compilers assume that the program cannot continue through the condition and generate code accordingly.

The MSVC ecosystem has a solution for this on x86: int3. The portable name is __debugbreak, a name I’ve borrowed elsewhere.

#define assert(c)  do if (!(c)) __debugbreak(); while (0)

On x86 it inserts an int3 instruction, which fires an interrupt, trapping in the attached debugger, or otherwise abnormally terminating the program. Because it’s an interrupt, it’s expected that the program might continue. It even leaves the instruction pointer on the next instruction. As of this writing, GCC has no matching intrinsic, but Clang recently added __builtin_debugtrap. In GCC you need some less portable inline assembly: asm ("int3").

However, regardless of how you get an int3 in your program, GDB does not currently understand it. The problem is that feature I mentioned: The instruction pointer does not point at the int3 but the next instruction. This confuses GDB, causing it to break in the wrong places, possibly even in the wrong scope. For example:

for (int i = 0; i < n; i++) {
    // ...
    int3_assert(...);
}

With int3 at the very end of the loop, GDB will break at the top of the next loop iteration, because that’s where the instruction pointer lands by the time GDB is involved. It’s a similar story when placed at the end of a function, leaving GDB to break in the caller. To resolve this, we need the instruction pointer to still be “inside” the breakpoint after the interrupt fires. Easy! Add a nop:

#define breakpoint()  asm ("int3; nop")

This behaves beautifully, eliminating all the problems GDB has with a plain int3. Not only is this a solid basis for a continuable assertion, it’s also useful as a fast conditional breakpoint, where conventional conditional breakpoints are far too slow.

for (int i = 0; i < 1000000000; i++) {
    if (/* rare condition */) breakpoint();
    // ...
}

Could GDB handle int3 better? Yes! Visual Studio, for instance, does not require the nop instruction. As far as I know there is no ARM equivalent compatible with GDB (or even LLDB). The closest instruction, brk #0x1, does not behave as needed.

Named positions

GDB’s built-in user interface understands three classes of breakpoint positions: symbols, context-free line numbers, and absolute addresses. When you set some breakpoints and (re)start a program under GDB, each kind of breakpoint is handled differently:

Resolve each symbol, placing a breakpoint on its run-time address.
Map each file+lineno tuple to a run-time address, and place a breakpoint on that address. If the line does not exist (i.e. the file is shorter), skip it.
Place breakpoints exactly on each absolute address. If it’s not a mapped address, don’t start the program.

The first is the best case because it adapts to program changes. Modify the code, recompile, and the breakpoint generally remains where you want it.

The third is the least useful. These breakpoints rarely survive across rebuilds, and sometimes not even across reruns.

The second is in the middle between useful and useless. If you edit the source file which has the breakpoint — likely, because you placed the breakpoint there for a reason — chances are high that the line number is no longer correct. Instead it drifts, requiring manual replacement. This is tedious and GDB ought to do better. Think that’s unreasonable? The Visual Studio debugger does exactly that quite effectively through external code edits! GDB front ends tend to handle it better, especially when they’re also the code editor and so directly observe all edits.

As a workaround we can get the first kind by temporarily naming a line number. This requires editing the source, but remember, the very reason we need it is because the source in question is actively changing. How to name a line? C and C++ labels give a name to program position:

void example(double *nums, int n, ...)
{
    for (int i = 0; i < n; i++) {
        loop:  // named position at the start of the loop
        // ...
    }
}

The name loop is local to example, but the qualified example:loop is a global name, as suitable as any other symbol. I could, say, reliably trace the progress of this loop despite changes to its position in the source.

(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]

One downside is dealing with -Wunused-label (enabled by -Wall), and so I’ve considered disabling the warning in my defaults. Update: Matthew Fernandez pointed out that the unused label attribute eliminates the warning, solving my problem:

    for (int i = 0; i < n; i++) {
        loop: __attribute((unused))
        // ...
    }

More often I use an assembly label, usually named b for convenience:

    for (int i = 0; i < n; i++) {
        asm ("b:");
        // ...
    }

Like int3, sometimes it’s necessary to give it a nop so that GDB has something on which to break. “Enabling” it at any time is quick:

(gdb) b b

Because it’s not .globl, it’s a weak symbol, and I can place up to one per translation unit, all covered by the same GDB breakpoint item (less useful than it sounds). I haven’t actually checked, but I probably more often use dprintf with such named lines than actual breakpoints.

If you have similar tips and tricks of your own, I’d like to learn about them!

Hand-written Windows API prototypes: fast, flexible, and tedious

2023-05-31T01:38:31Z

I love fast builds, and for years I’ve been bothered by the build penalty for translation units including windows.h. This header has an enormous number of definitions and declarations and so, for C programs, it tends to dominate the build time of those translation units. Most programs, especially systems software, only needs a tiny portion of it. For example, when compiling u-config with GCC, two thirds of the debug build was spent processing windows.h just for 4 types, 16 definitions, and 16 prototypes.

To give a sense of the numbers, here’s empty.c, which does nothing but include windows.h.

#include 

With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):

$ gcc -E empty.c | grep -vc '^$'
82041

With w64devkit this takes my system ~450ms to compile with GCC:

$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s

Compiling an actually empty source file takes ~10ms, so it really is spending practically all that time processing headers. MSVC is a faster compiler, and this extends to processing an even larger windows.h that crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:

$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s

That’s just low enough to be tolerable, but I’d like the situation with GCC to be better. Defining WIN32_LEAN_AND_MEAN reduces the number of included headers, which has a significant effect:

$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s

Precompiled headers

The official solution is precompiled headers. Put all the system header includes, or similar, into a dedicated header, then compile that header into a special format. For example, headers.h:

#define WIN32_LEAN_AND_MEAN
#include 

Then main.c includes windows.h through this header:

#include "headers.h"

int mainCRTStartup(void)
{
    return 0;
}

If I ask GCC to compile headers.h:

$ gcc headers.h

It produces headers.h.gch. When a source includes headers.h, GCC first searches for an appropriate .gch. Not only must the name match, but so must all the definitions at the moment of inclusion: headers.h should always be the first included header, otherwise it may not work. Now when I compile main.c:

$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

Much better! MSVC has a conventional name for this header recognizable to every Visual Studio user: stdafx.h. It works a bit differently, and I’ve never used it myself, but I trust it has similar results.

Precompiled headers requires some extra steps that vary by toolchain. Can we do better? That depends on your definition of “better!”

Artisan, handcrafted prototypes

As mentioned, systems software tends to need only a few declarations: open, read, write, stat, etc. What if I wrote these out manually? A bit tedious, but it doesn’t require special precompiled header handling. It also creates some new possibilities. To illustrate, a CRT-free “hello world” program:

#include 

int mainCRTStartup(void)
{
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    char message[] = "Hello, world!\n";
    DWORD len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

This takes my system half a second to compile — quite long to produce just 26 assembly instructions:

$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!

The program requires prototypes only for GetStdHandle and WriteFile, a definition for STD_OUTPUT_HANDLE, and some typedefs. Starting with the easy stuff, the definition and types look like this:

#define STD_OUTPUT_HANDLE ((DWORD)-11)

typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;

By the way, here’s a cheat code for quickly finding preprocessor definitions, faster than looking them up elsewhere:

$ echo '#include ' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)

Did you catch the pattern? It’s -10 - fd, where fd is the conventional unix file descriptor number: a kind of mnemonic.

Prototypes are a little trickier, especially if you care about 32-bit. The Windows API uses the “stdcall” calling convention, which is distinct from the “cdecl” calling convention on x86, though the same on x64. Of course, you must already be aware of this merely using the API, as your own callbacks must usually be stdcall themselves. Further, API functions are DLL imports and should be declared as such. Putting it together, here’s GetStdHandle:

__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);

This works with both Mingw-w64 and MSVC. MSVC requires __stdcall between the return type and function name, so don’t get clever about it. If you only care about GCC then you can declare both at once using attributes:

HANDLE GetStdHandle(DWORD)
    __attribute__((dllimport,stdcall));

I like to hide all this behind a macro, with a “table” of all my imports listed just below:

#define W32(r) __declspec(dllimport) r __stdcall
W32(HANDLE) GetStdHandle(DWORD);
W32(BOOL)   WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);

In WriteFile you may have noticed I’m taking shortcuts. The “official” definition uses an ugly pointer typedef, LPCVOID, instead of pointer syntax, but I skipped that type definition. I also replaced the last argument, an OVERLAPPED pointer, with a generic pointer. I only need to pass null. I can keep sanding it down to something more ergonomic:

W32(int)    WriteFile(void *, void *, int, int *, void *);

That’s how I typically write these prototypes. I dropped the const because it doesn’t help me. I used signed sizes because I like them better and it’s what I’m usually holding at the call site. But doesn’t changing the signedness potentially break compatibility? It makes no difference to any practical ABI: It’s passed the same way. In general, signedness is a matter for operators, and only some of them — mainly comparisons (<, >, etc.) and division. It’s a similar story for pointers starting with the 32-bit era, so I can choose whatever pointer types are convenient.

In general, I can do anything I want so long as I know my compiler will produce an appropriate function call. These are not standard functions, like printf or memcpy, which are implemented in part by the compiler itself, but foreign functions. It’s no different than teaching an FFI how to make a call. This is also, in essence, how OpenGL and Vulkan work, with applications defining the API for themselves.

Considering all this, my new hello world:

#define W32(r) __declspec(dllimport) r __stdcall
W32(void *) GetStdHandle(int);
W32(int)    WriteFile(void *, void *, int, int *, void *);

int mainCRTStartup(void)
{
    void *stdout = GetStdHandle(-10 - 1);
    char message[] = "Hello, world!\n";
    int len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

You know, there’s a kind of beauty to a program that requires no external definitions. It builds quickly and produces a binary bit-for-bit identical to the original:

$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s

I’ve also been using this to patch over API rough edges. For example, WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus takes OVERLAPPED. These types are explicitly compatible, and only defined separately for annoying technical reasons. I must use the same overlapped object with both APIs at once, meaning I would normally need ugly pointer casts on my Winsock calls, or vice versa with I/O completion ports. But because I’m writing all these definitions myself, I can define a common overlapped structure for both!

Perhaps you’re worried that this would be too fragile. Well, as a legacy software aficionado, I enjoy building and running my programs on old platforms. So far these programs still work properly going back 30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag, it’s always been a bug (now long fixed) in the old operating system, not in my programs or these prototypes. So, in effect, this technique has worked well for the past 30 years!

Writing out these definitions is a bit of a chore, but after paying that price I’ve been quite happy with the results. I will likely continue doing it in the future, at least for non-graphical applications.

My favorite C compiler flags during development

2023-04-29T22:55:25Z

This article was discussed on Hacker News and on reddit.

The major compilers have an enormous number of knobs. Most are highly specialized, but others are generally useful even if uncommon. For warnings, the venerable -Wall -Wextra is a good start, but circumstances improve by tweaking this warning set. This article covers high-hitting development-time options in GCC, Clang, and MSVC that ought to get more consideration.

There’s an irony that the more you use these options, the less useful they become. Given a reasonable workflow, they are a harsh mistress in a fast, tight feedback loop quickly breaking the habits that cause warnings and errors. It’s a kind of self-improvement, where eventually most findings will be false positives. With heuristics internalized, you will be able spot the same issues just reading code — a handy skill during code review.

Static warnings

Traditionally, C and C++ compilers are by default conservative with warnings. Unless configured otherwise, they only warn about the most egregious issues where it’s highly confident. That’s too conservative. For gcc and clang, the first order of business is turning on more warnings with -Wall. Despite the name, this doesn’t actually enable all warnings. (clang has -Weverything which does literally this, but trust me, you don’t want it.) However, that still falls short, and you’re better served enabling extra warnings on with -Wextra.

$ cc -Wall -Wextra ...

That should be the baseline on any new project, and closer to what these compilers should do by default. Not using these means leaving value on the table. If you come across such a project, there’s a good chance you can find bugs statically just by using this baseline. Some warnings only occur at higher optimization levels, so leave these on for your release builds, too.

For MSVC, including clang-cl, a similar baseline is /W4. Though it goes a bit far, warning about use of unary minus on unsigned types (C4146), and sign conversions (C4245). If you’re using a CRT, also disable the bogus and irresponsible “security” warnings. Putting it together, the warning baseline becomes:

$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...

As for gcc and clang, I dislike unused parameter warnings, so I often turn it off, at least while I’m working: -Wno-unused-parameter. Rarely is it a defect to not use a parameter. It’s common for a function to fit a fixed prototype but not need all its parameters (e.g. WinMain). Were it up to me, this would not be part of -Wextra.

I also dislike unused functions warnings: -Wno-unused-function. I can’t say this is wrong for the baseline since, in most cases, ultimately I do want to know if there are unused functions, e.g. to be deleted. But while I’m working it’s usually noise.

If I’m working with OpenMP, I may also disable warnings about unknown pragmas: -Wno-unknown-pragmas. One cool feature of OpenMP is that the typical case gracefully degrades to single-threaded behavior when not enabled. That is, compiling without -fopenmp. I’ll test both ways to ensure I get deterministic results, or just to ease debugging, and I don’t want warnings when it’s disabled. It’s fine for the baseline to have this warning, but sometimes it’s a poor match.

When working with single-precision floats, perhaps on games or graphics, it’s easy to accidentally introduce promotion to double precision, which can hurt performance. It could be neglecting an f suffix on a constant or using sin instead of sinf. Use -Wdouble-promotion to catch such mistakes. Honestly, this is important enough that it should go into the baseline.

#define PI 3.141592653589793
float degs = ...;
float rads = degs * PI / 180;  // warns about promotion

It can be awkward around variadic functions, particularly printf, which cannot receive float arguments, and so implicitly converts. You’ll need a explicit cast to disable the warning. I imagine this is the main reason the warning is not part of -Wextra.

float x = ...;
printf("%.17g\n", (double)x);

Finally, an advanced option: -Wconversion -Wno-sign-conversion. It warns about implicit conversions that may result in data loss. Sign conversions do not have data loss, the implicit conversions are useful, and in my experience they’re not a source of defects, so I disable that part using the second flag (like MSVC /wd4245). The important warning here is truncation of size values, warning about unsound uses of sizes and subscripts. For example:

// NOTE: would be declared/defined via windows.h
typedef uint32_t DWORD;
BOOL WriteFile(HANDLE, const void *, DWORD, DWORD *, OVERLAPPED *);

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    WriteFile(err, msg, len, &out, 0);  // len truncation warning
}

On 64-bit targets, it will warn about truncating the 64-bit len for the 32-bit parameter. To dismiss the warning, you must either address it by using a loop to call WriteFile multiple times, or acknowledge the truncation with an explicit cast and accept the consequences. In this case I may know from context it’s impossible for the program to even construct such a large message, so I’d use an assertion and truncate.

void logmsg(char *msg, size_t len)
{
    HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
    DWORD out;
    assert(len <= 0xffffffff);
    WriteFile(err, msg, (DWORD)len, &out, 0);
}

You might consider changing the interface instead:

void logmsg(char *msg, uint32_t len);

That probably passes the buck and doesn’t solve the underlying problem. The caller may be holding a size_t length, so the truncation happens there instead. Or maybe you keep propagating this change backwards until it, say, dissipates on a known constant. -Wconversion leads to these ripple effects that improves the overall program, which is why I like it.

The catch is that the above warning only happens for 64-bit targets. So you might miss it. The inverse is true in other cases. This is one area where cross-architecture testing can pay off.

Unfortunately since this warning is off the beaten path, it seems like it doesn’t quite get the attention it could use. It warns about simple cases where truncation has been explicitly handled/avoided. For example:

int x = ...;
char digit = '0' + x%10;  // false warning

The '0' is a known constant. The operation x%10 has a known range (-9 to 9). Therefore the addition result has a known range, and all results can be represented in a char. Yet it still warns. This often comes up dealing with character data like this.

In my logmsg fix I had used an assertion to check that no truncation actually occurred. But wouldn’t it be nice if the compiler could generate that for us somehow? That brings us to dynamic checks.

Dynamic run-time checks

Sanitizers have been around for nearly a decade but are still criminally underused. They insert run-time assertions into programs at the flip of a switch typically at a modest performance cost — less than the cost of a debug build. All three major compilers support at least one sanitizer on all targets. In most cases, failing to use them is practically the same as not even trying to find defects. Every beginner tutorial ought to be using sanitizers from page 1 where they teach how to compile a program with gcc. (That this is universally not the case, and that these same tutorials also do not begin with teaching a debugger, is a major, on-going education failure.)

There are multiple different sanitizers with lots of overlap, but Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most general. They are compatible with each other and form a solid, general baseline. To use address sanitizer, at both compile and link time do:

$ cc ... -fsanitize=address ...

It’s even spelled the same way in MSVC. It’s needed at link time because it includes a runtime component. When working properly it’s aware of all allocations and checks all memory accesses that might be out of bounds, producing a run-time error if that occurs. It’s not always appropriate, but most projects that can use it probably should.

UBSan is enabled similarly:

$ cc ... -fsanitize=undefined ...

It adds checks around operations that might be undefined, emitting a run-time error if it occurs. It has an optional runtime component to produce a helpful diagnostic. You can instead insert a trap instruction, which is how I prefer to use it: -fsanitize-trap=undefined. (Until recently it was -fsanitize-undefined-trap-on-error.) This works on platforms where the UBSan runtime is unsupported. Some instrumentation is only inserted at higher optimization levels.

For me, the most useful UBSan check is signed overflow — e.g. computing the wrong result — and it’s instrumentation I miss when not working in C. In programs where this might be an issue, combine it with a fuzzer to search for inputs that cause overflows. This is yet another argument in favor of signed sizes, as UBSan can detect such overflows. (Yes, UBSan optionally instruments unsigned overflow, too, but then you must somehow distinguish intentional from unintentional overflow.)

On Linux, ASan and UBSan strangely do not have debugger-oriented defaults. Fortunately that’s easy to address with a couple of environment variables, which cause them to break on error instead of uselessly exiting:

export ASAN_OPTIONS=abort_on_error=1:halt_on_error=1
export UBSAN_OPTIONS=abort_on_error=1:halt_on_error=1

Also, when compiling you can combine sanitizers like so:

$ cc ... -fsanitize=address,undefined ...

As of this writing, MSVC does not have UBSan, but it does have a similar feature, run-time error checks. Three sub-flags (c, s, u) enable different checks, and /RTCcsu turns them all on. The c flag generates the assertion I had manually written with -Wconversion, and traps any truncation at run time. There’s nothing quite like this in UBSan! It’s so extreme that it’s compatible with neither standard runtime libraries (fortunately not a big deal) nor with ASan.

Caveat: Explicit casts aren’t enough, you must actually truncate variables using a mask in order to pass the check. For example, to accept truncation in the logmsg function:

    WriteFile(err, msg, len&0xffffffff, &out, 0);

Thread Sanitizer (TSan) is occasionally useful for finding — or, more often, proving the presence of — data races. It has a runtime component and so must be used at compile time and link time.

$ cc ... -fsanitize=thread ...

Unfortunately it only works in a narrow context. The target must use pthreads, not C11 threads, OpenMP, nor direct cloning. It must only synchronize through code that was compiled with TSan. That means no synchronization through system calls, especially no futexes. Most non-trivial programs do not meet the criteria.

Debug information

Another common mistake in tutorials is using plain old -g instead of -g3 (read: “debug level 3”). That’s like using -O instead of -O3. It adds a lot more debug information to the output, particularly enums and macros. The extra information is useful and you’re better off having it!

$ cc ... -g3 ...

All the major build systems — CMake, Autotools, Meson, etc. — get this wrong in their standard debug configurations. Producing a fully-featured debug build from these systems is a constant battle for me. Often it’s easier to ignore the build system entirely and cc -g3 **/*.c (plus sanitizers, etc.).

(Short term note: GCC 11, released in March 2021, switched to DWARF5 by default. However, GDB could not access the extra -g3 debug information in DWARF5 until GDB 13, released February 2023. If you have a toolchain from that two year window — except mine because I patched it — then you may also need -gdwarf-4 to switch back to DWARF4.)

What about -Og? In theory it enables optimizations that do not interfere with debugging, and potentially some additional warnings. In practice I still get far too many “optimized out” messages from GDB when I use it, so I don’t bother. Fortunately C is such a simple language that debug builds are nearly as fast as release builds anyway.

On MSVC I like having debug information embedded in binaries, as GCC does, which is done using /Z7.

$ cl ... /Z7 ...

Though I certainly understand the value of separate debug information, /Zi, in some cases. Sometimes I wish the GNU toolchain made this easier.

Summary

My personal rigorous baseline for development using gcc and clang looks like this (all platforms):

$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
     -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
     -fsanitize=undefined -fsanitize-trap ...

While ASan is great for quickly reviewing and evaluating other people’s projects, I don’t find it useful for my own programs. I avoid that class of defects through smarter paradigms (region-based allocation, no null terminated strings, etc.). I also prefer the behavior of trap instruction UBSan versus a diagnostic, as it behaves better under debuggers.

For cl and clang-cl, my personal baseline looks like this:

$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...

I don’t normally need /D_CRT_SECURE_NO_WARNINGS since I don’t use a CRT anyway.

Update: Peter0x44 points out -D_GLIBCXX_DEBUG if you’re working in C++ with libstdc++, including on Windows with Mingw-w64. I agree, this is an excellent option! ASan does not “see” C++ containers, and it fills in some of those gaps.

My new debugbreak command

2022-07-31T12:59:59Z

I previously mentioned the Windows feature where pressing F12 in a debuggee window causes it to break in the debugger. It works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the hotkey simply raises a breakpoint structured exception. It’s been surprisingly useful, and I’ve wanted it available in more contexts, such as console programs or even on Linux. The result is a new debugbreak command, now included in w64devkit. Though, of course, you already have everything you need to build it and try it out right now. I’ve also worked out a Linux implementation.

It’s named after an MSVC intrinsic and Win32 function. It takes no arguments, and its operation is indiscriminate: It raises a breakpoint exception in all debuggee processes system-wide. Reckless? Perhaps, but certainly convenient. You don’t need to tell it which process you want to pause. It just works, and a good debugging experience is one of ease and convenience.

The linchpin is DebugBreakProcess. The command walks the process list and fires this function at each process. Nothing happens for programs without a debugger attached, so it doesn’t even bother checking if it’s a debuggee. It couldn’t be simpler. I’ve used it on everything from Windows XP to Windows 11, and it’s worked flawlessly.

HANDLE s = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
PROCESSENTRY32W p = {sizeof(p)};
for (BOOL r = Process32FirstW(s, &p); r; r = Process32NextW(s, &p)) {
    HANDLE h = OpenProcess(PROCESS_ALL_ACCESS, 0, p.th32ProcessID);
    if (h) {
        DebugBreakProcess(h);
        CloseHandle(h);
    }
}

I use it almost exclusively from Vim, where I’ve given it a leader mapping. With the editor focused, I can type backslash then d to pause the debuggee.

map <leader>d :call system("debugbreak")<cr>

With the debuggee paused, I’m free to add new breakpoints or watchpoints, or print the call stack to see what the heck it’s busy doing. The mechanism behind DebugBreakProcess is to create a new thread in the target, with that thread raising the breakpoint exception. The debugger will be stopped in this new thread. In GDB you can use the thread command to switch over to the thread that actually matters, usually thr 1.

debugbreak on Linux

On unix-like systems the equivalent of a breakpoint exception is a SIGTRAP. There’s already a standard command for sending signals, kill, so a debugbreak command can be built using nothing more than a few lines of shell script. However, unlike DebugBreakProcess, signaling every process with SIGTRAP will only end in tears. The script will need a way to determine which processes are debuggees.

Linux exposes processes in the file system as virtual files under /proc, where each process appears as a directory. Its status file includes a TracerPid field, which will be non-zero for debuggees. The script inspects this field, and if non-zero sends a SIGTRAP.

#!/bin/sh
set -e
for pid in $(find /proc -maxdepth 1 -printf '%f\n' | grep '^[0-9]\+$'); do
    grep -q '^TracerPid:\s[^0]' /proc/$pid/status 2>/dev/null &&
        kill -TRAP $pid
done

This script, now part of my dotfiles, has worked very well so far, and effectively smoothes over some debugging differences between Windows and Linux, reducing my context switching mental load. There’s probably a better way to express this script, but that’s the best I could do so far. On the BSDs you’d need to parse the output of ps, though each system seems to do its own thing for distinguishing debuggees.

A missing feature

I had originally planned for one flag, -k. Rather than breakpoint debugees, it would terminate all debuggee processes. This is especially important on Windows where debuggee processes block builds due to file locking shenanigans. I’d just run debugbreak -k as part of the build. However, it’s not possible to terminate debuggees paused in the debugger — the common situation. I’ve given up on this for now.

Assertions should be more debugger-oriented

2022-06-26T18:51:04Z

Prompted by a 20 minute video, over the past month I’ve improved my debugger skills. I’d shamefully acquired a bad habit: avoiding a debugger until exhausting dumber, insufficient methods. My first choice should be a debugger, but I had allowed a bit of friction to dissuade me. With some thoughtful practice and deliberate effort clearing the path, my bad habit is finally broken — at least when a good debugger is available. It feels like I’ve leveled up and, like touch typing, this was a skill I’d neglected far too long. One friction point was the less-than-optimal assert feature in basically every programming language implementation. It ought to work better with debuggers.

An assertion verifies a program invariant, and so if one fails then there’s undoubtedly a defect in the program. In other words, assertions make programs more sensitive to defects, allowing problems to be caught more quickly and accurately. Counter-intuitively, crashing early and often makes for more robust and reliable software in the long run. For exactly this reason, assertions go especially well with fuzzing.

assert(i >= 0 && i < len);   // bounds check
assert((ssize_t)size >= 0);  // suspicious size_t
assert(cur->next != cur);    // circular reference?

They’re sometimes abused for error handling, which is a reason they’ve also been (wrongfully) discouraged at times. For example, failing to open a file is an error, not a defect, so an assertion is inappropriate.

Normal programs have implicit assertions all over, even if we don’t usually think of them as assertions. In some cases they’re checked by the hardware. Examples of implicit assertion failures:

Out-of-bounds indexing
Dereferencing null/nil/None
Dividing by zero
Certain kinds of integer overflow (e.g. -ftrapv)

Programs are generally not intended to recover from these situations because, had they been anticipated, the invalid operation wouldn’t have been attempted in the first place. The program simply crashes because there’s no better alternative. Sanitizers, including Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan), are in essence additional, implicit assertions, checking invariants that aren’t normally checked.

Ideally a failing assertion should have these two effects:

Execution should immediately stop. The program is in an unknown state, so it’s neither safe to “clean up” nor attempt to recover. Additional execution will only make debugging more difficult, and may obscure the defect.
When run under a debugger — or visited as a core dump — it should break exactly at the failed assertion, ready for inspection. I should not need to dig around the call stack to figure out where the failure occurred. I certainly shouldn’t need to manually set a breakpoint and restart the program hoping to fail the assertion a second time. The whole reason for using a debugger is to save time, so if it’s wasting my time then it’s failing at its primary job.

I examined standard assert features across various language implementations, and none strictly meet the criteria. Fortunately, in some cases, it’s trivial to build a better assertion, and you can substitute your own definition. First, let’s discuss the way assertions disappoint.

A test assertion

My test for C and C++ is minimal but establishes some state and gives me a variable to inspect:

#include 

int main(void)
{
    for (int i = 0; i < 10; i++) {
        assert(i < 5);
    }
}

Then I compile and debug in the most straightforward way:

$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt

The r in GDB stands for run, which immediately breaks because of the assert. The bt prints a backtrace. On a typical Linux distribution that shows this backtrace:

#0  __GI_raise
#1  __GI_abort
#2  __assert_fail_base
#3  __GI___assert_fail
#4  main

Well, actually, it’s much messier than this, but I manually cleaned it up:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1  0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2  0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i < 5", file=0x555555556004 "test.c", line=6, f
unction=) at assert.c:92
#3  0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i < 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 <__PRETTY_FUNCTION__.0> "main") at assert
.c:101
#4  0x0000555555555178 in main () at test.c:6

That’s a lot to take in at a glance, and about 95% of it is noise that will never contain useful information. Most notably, GDB didn’t stop at the failing assertion. Instead there’s four stack frames of libc junk I have to navigate before I can even begin debugging.

(gdb) up
(gdb) up
(gdb) up
(gdb) up

I must wade through this for every assertion failure. This is some of the friction that made me avoid the debugger in the first place. glibc loves indirection, so maybe the other libc implementations do better? How about musl?

#0  setjmp
#1  raise
#2  ??
#3  ??
#4  ??
#5  ??
#6  ??
#7  ??
#8  ??
#9  ??
#10 ??
#11 ??

Oops, without musl debugging symbols I can’t debug assertions at all because GDB can’t read the stack, so it’s lost. If you’re on Alpine you can install musl-dbg, but otherwise you’ll probably need to build your own from source. With debugging symbols, musl is no better than glibc:

#0  __restore_sigs
#1  raise
#2  abort
#3  __assert_fail
#4  main

Same with FreeBSD:

#0  thr_kill
#1  in raise
#2  in abort
#3  __assert
#4  main

OpenBSD has one fewer frame:

#0  thrkill
#1  _libc_abort
#2  _libc___assert2
#3  main

How about on Windows with Mingw-w64?

[Inferior 1 (process 7864) exited with code 03]

Oops, on Windows GDB doesn’t break at all on assert. You must first set a breakpoint on abort:

(gdb) b abort

Besides that, it’s the most straightforward so far:

#0 msvcrt!abort
#1 msvcrt!_assert
#2 main

With MSVC (default CRT) I get something slightly different:

#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh

RemedyBG leaves me at the abort like GDB does elsewhere. Visual Studio recognizes that I don’t care about its stack frames and instead puts the focus on the assertion, ready for debugging. The other stack frames are there, but basically invisible. It’s the only case that practically meets all my criteria!

I can’t entirely blame these implementations. The C standard requires that assert print a diagnostic and call abort, and that abort raises SIGABRT. There’s not much implementations can do, and it’s up to the debugger to be smarter about it.

Sanitizers

ASan doesn’t break GDB on assertion failures, which is yet another source of friction. You can work around this with an environment variable:

export ASAN_OPTIONS=abort_on_error=1:print_legend=0

This works, but it’s the worst case of all: I get 7 junk stack frames on top of the failed assertion. It’s also very noisy when it traps, so the print_legend=0 helps to cut it down a bit. I want this variable so often that I set it in my shell’s .profile so that it’s always set.

With UBSan you can use -fsanitize-undefined-trap-on-error, which behaves like the improved assertion. It traps directly on the defect with no junk frames, though it prints no diagnostic. As a bonus, it also means you don’t need to link libubsan. Thanks to the bonus, it fully supplants -ftrapv for me on all platforms.

Update November 2022: This “stop” hook eliminates ASan friction by popping runtime frames — functions with the reserved __ prefix — from the call stack so that they’re not in the way when GDB takes control. It requires Python support, which is the purpose of the feature-sniff outer condition.

if !$_isvoid($_any_caller_matches)
    define hook-stop
        while $_thread && $_any_caller_matches("^__")
            up-silently
        end
    end
end

This is now part of my .gdbinit.

A better assertion

At least when under a debugger, here’s a much better assertion macro for GCC and Clang:

#define assert(c) if (!(c)) __builtin_trap()

__builtin_trap inserts a trap instruction — a built-in breakpoint. By not calling a function to raise a signal, there are no junk stack frames and no need to breakpoint on abort. It stops exactly where it should as quickly as possible. This definition works reliably with GCC across all platforms, too. On MSVC the equivalent is __debugbreak. If you’re really in a pinch then do whatever it takes to trigger a fault, like dereferencing a null pointer. A more complete definition might be:

#ifdef DEBUG
#  if __GNUC__
#    define assert(c) if (!(c)) __builtin_trap()
#  elif _MSC_VER
#    define assert(c) if (!(c)) __debugbreak()
#  else
#    define assert(c) if (!(c)) *(volatile int *)0 = 0
#  endif
#else
#  define assert(c)
#endif

None of these print a diagnostic, but that’s unnecessary when a debugger is involved.

Other languages

Unfortunately the situation mostly gets worse with other language implementations, and it’s generally not possible to build a better assertion. Assertions typically have exception-like semantics, if not literally just another exception, and so they are far less reliable. If a failed assertion raises an exception, then the program won’t stop until it’s unwound the stack — running destructors and such along the way — all the way to the top level looking for a handler. It only knows there’s a problem when nobody was there to catch it.

Go officially doesn’t have assertions, though panics are a kind of assertion. However, panics have exception-like semantics, and so suffer the problems of exceptions. A Go version of my test:

func main() {
    defer fmt.Println("DEFER")
    for i := 0; i < 10; i++ {
        if i >= 5 {
            panic(i)
        }
    }
}

If I run this under Go’s premier debugger, Delve, the unrecovered panic causes it to break. So far so good. However, I get two junk frames:

#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit

It only knows to stop because the Go runtime called fatalpanic, but the backtrace is a fiction: The program continued to run after the panic, enough to run all the registered defers (including printing “DEFER”), unwinding the stack to the top level, and only then did it fatalpanic. Fortunately it’s still possible to inspect all those stack frames even if some variables may have changed while unwinding, but it’s more like inspecting a core dump than a paused process.

The situation in Python is similar: assert raises AssertionError — a plain old exception — and pdb won’t break until the stack has unwound, exiting context managers and such. Only once the exception reaches the top level does it enter “post mortem debugging,” like a core dump. At least there are no junk stack frames on top. If you’re using asyncio then your program may continue running for quite awhile before the right tasks are scheduled and the exception finally propagates to the top level, if ever.

The worst offender of all is Java. First jdb never breaks for unhandled exceptions. It’s up to you to set a breakpoint before the exception is thrown. But it gets worse: assertions are disabled under jdb. The Java assert statement is worse than useless.

Addendum: Don’t exit the debugger

The largest friction-reducing change I made is never exiting the debugger. Previously I would enter GDB, run my program, exit, edit/rebuild, repeat. However, there’s no reason to exit GDB! It automatically and reliably reloads symbols and updates breakpoints on symbols. It remembers your run configuration, so re-running is just r rather than interacting with shell history.

My workflow on all platforms (including Windows) is a vertically maximized Vim window and a vertically maximized terminal window. The new part for me: The terminal runs a long-term GDB session exclusively, with file set to the program I’m writing, usually set by initial the command line.

$ gdb myprogram
gdb>

Alternatively use file after starting GDB. Occasionally useful if my project has multiple binaries, and I want to examine a different program.

gdb> file myprogram

I use make and Vim’s :mak command for building from within the editor, so I don’t need to change context to build. The quickfix list takes me straight to warnings/errors. Often I’m writing something that takes input from standard input. So I use the run (r) command to set this up (along with any command line arguments).

gdb> r



You can redirect standard output as well. It remembers these settings for
plain run later, so I can test my program by entering r and nothing
else.

gdb> r


My usual workflow is edit, :mak, r, repeat. If I want to test a
different input or use different options, change the run configuration
using run again:

gdb> r -a -b -c 


On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use kill (k) to stop it
without exiting GDB.

gdb> k


GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
set confirm no in your .gdbinit to disable it.

Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with pkill, which
GDB will catch (except on Windows):

:!pkill myprogram


I suspect many people don’t know this, but if you’re on Windows and
developing a graphical application, you can press F12 in the
debuggee’s window to immediately break the program in the attached
debugger. This is a general platform feature and works with any native
debugger. I’ve been using it quite a lot.

On that note, you can run commands from GDB with !, which is another way
to avoid having an extra terminal window around:

gdb> !git diff


In any case, GDB will re-read the binary on the next run and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.

gdb> b somefunc
gdb> r


Alternatively I’ll use a line number, which I read from Vim. Though GDB,
not being involved in the editing process, cannot track how that line
moves between builds.

An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type next (n) — or step (s) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:

gdb> n
gdb>
gdb>


(I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful list command
is inadequate. Update: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either tty or set new-console. I can access it everywhere since
w64devkit now supports GDB TUI.)

If I want to advance to the next breakpoint, I use continue (c):

gdb> c


If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep printing (p) the same variables again and again.
So I use display (disp) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is i over some string str, this will show me the current
character in character format (/c).

gdb> disp/c str[i]


You can accumulate multiple expressions. Use undisplay to remove them.

Too many breakpoints? Use info breakpoints (i b) to list them, then
delete (d) the unwanted ones by ID.

gdb> i b
gdb> d 3 5 8


GDB has many more feature than this, but 10 commands cover 99% of use
cases: r, c, n, s, disp, k, b, i, d, p.



A flexible, lightweight, spin-lock barrier
2022-03-13T23:55:08Z
This article was discussed on Hacker News.

The other day I wanted try the famous memory reordering experiment
for myself. It’s the double-slit experiment of concurrency, where a
program can observe an “impossible” result on common hardware, as
though a thread had time-traveled. While getting thread timing as tight as
possible, I designed a possibly-novel thread barrier. It’s purely
spin-locked, the entire footprint is a zero-initialized integer, it
automatically resets, it can be used across processes, and the entire
implementation is just three to four lines of code.



Here’s the entire barrier implementation for two threads in C11.

// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
    uint32_t v = ++*barrier;
    if (v & 1) {
        for (v &= 2; (*barrier&2) == v;);
    }
}


Or in Go:

func BarrierWait(barrier *uint32) {
    v := atomic.AddUint32(barrier, 1)
    if v&1 == 1 {
        v &= 2
        for atomic.LoadUint32(barrier)&2 == v {
        }
    }
}


Even more, these two implementations are compatible with each other. C
threads and Go goroutines can synchronize on a common barrier using these
functions. Also note how it only uses two bits.

When I was done with my experiment, I did a quick search online for other
spin-lock barriers to see if anyone came up with the same idea. I found a
couple of subtly-incorrect spin-lock barriers, and some
straightforward barrier constructions using a mutex spin-lock.

Before diving into how this works, and how to generalize it, let’s discuss
the circumstance that let to its design.

Experiment

Here’s the setup for the memory reordering experiment, where w0 and w1
are initialized to zero.

thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0


Considering all the possible orderings, it would seem that at least one of
r0 or r1 is 1. There seems to be no ordering where r0 and r1 could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.

How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use volatile for w0 and w1. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
volatile.

So my first idea was to use a bit of inline assembly for all accesses that
would otherwise be data races. x86-64:

static int experiment(int *w0, int *w1)
{
    int r1;
    __asm volatile (
        "movl  $1, %1\n"
        "movl  %2, %0\n"
        : "=r"(r1), "=m"(*w0)
        : "m"(*w1)
    );
    return r1;
}


ARM64 (to try on my Raspberry Pi):

static int experiment(int *w0, int *w1)
{
    int r1 = 1;
    __asm volatile (
        "str  %w0, %1\n"
        "ldr  %w0, %2\n"
        : "+r"(r1), "=m"(w0)
        : "m"(w1)
    );
    return r1;
}


This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
static.

Alternatively, I could use C11 atomics with a relaxed memory order:

static int experiment(_Atomic int *w0, _Atomic int *w1)
{
    atomic_store_explicit(w0, 1, memory_order_relaxed);
    return atomic_load_explicit(w1, memory_order_relaxed);
}


Since this is a race and I want both threads to run their two experiment
instructions as simultaneously as possible, it would be wise to use some
sort of starting barrier… exactly the purpose of a thread barrier! It
will hold the threads back until they’re both ready.

int w0, w1, r0, r1;

// thread#1                   // thread#2
w0 = w1 = 0;
BARRIER;                      BARRIER;
r1 = experiment(&w0, &w1);    r0 = experiment(&w1, &w0);
BARRIER;                      BARRIER;

if (!r0 && !r1) {
    puts("impossible!");
}


The second thread goes straight into the barrier, but the first thread
does a little more work to initialize the experiment and a little more at
the end to check the result. The second barrier ensures they’re both done
before checking.

Running this only once isn’t so useful, so each thread loops a few million
times, hence the re-initialization in thread#1. The barriers keep them
lockstep.

Barrier selection

On my first attempt, I made the obvious decision for the barrier: I used
pthread_barrier_t. I was already using pthreads for spawning the
extra thread, including on Windows, so this was convenient.

However, my initial results were disappointing. I only observed an
“impossible” result around one in a million trials. With some debugging I
determined that the pthreads barrier was just too damn slow, throwing off
the timing. This was especially true with winpthreads, bundled with
Mingw-w64, which in addition to the per-barrier mutex, grabs a global
lock twice per wait to manage the barrier’s reference counter.

All pthreads implementations I used were quick to yield to the system
scheduler. The first thread to arrive at the barrier would go to sleep,
the second thread would wake it up, and it was rare they’d actually race
on the experiment. This is perfectly reasonable for a pthreads barrier
designed for the general case, but I really needed a spin-lock barrier.
That is, the first thread to arrive spins in a loop until the second
thread arrives, and it never interacts with the scheduler. This happens so
frequently and quickly that it should only spin for a few iterations.

Barrier design

Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to w0, w1) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.

I know statically that there are only two threads, simplifying the
implementation. The plan: When threads arrive, they atomically increment a
shared variable to indicate such. The first to arrive will see an odd
number, telling it to atomically read the variable in a loop until the
other thread changes it to an even number.

At first with just two threads this might seem like a single bit would
suffice. If the bit is set, the other thread hasn’t arrived. If clear,
both threads have arrived.

void broken_wait1(_Atomic unsigned *barrier)
{
    ++*barrier;
    while (*barrier&1);
}

Or to avoid an extra load, use the result directly:

void broken_wait2(_Atomic unsigned *barrier)
{
    if (++*barrier & 1) {
        while (*barrier&1);
    }
}


Neither of these work correctly, and the other mutex-free barriers I found
all have the same defect. Consider the broader picture: Between atomic
loads in the first thread spin-lock loop, suppose the second thread
arrives, passes through the barrier, does its work, hits the next barrier,
and increments the counter. Both threads see an odd counter simultaneously
and deadlock. No good.

To fix this, the wait function must also track the phase. The first
barrier is the first phase, the second barrier is the second phase, etc.
Conveniently the rest of the integer acts like a phase counter!
Writing this out more explicitly:

void barrier_wait(_Atomic unsigned *barrier)
{
    unsigned observed = ++*barrier;
    unsigned thread_count = observed & 1;
    if (thread_count != 0) {
        // not last arrival, watch for phase change
        unsigned init_phase = observed >> 1;
        for (;;) {
            unsigned current_phase = *barrier >> 1;
            if (current_phase != init_phase) {
                break;
            }
        }
    }
}


The key: When the last thread arrives, it overflows the thread counter to
zero and increments the phase counter in one operation.

By the way, I’m using unsigned since it may eventually overflow, and
even _Atomic int overflow is undefined for the ++ operator. However,
if you use atomic_fetch_add or C++ std::atomic then overflow is
defined and you can use int.

Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (>>), I
mask (&) the phase bit with 2.

With this spin-lock barrier, the experiment observes r0 = r1 = 0 in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.

Generalizing to more threads

Two threads required two bits. This generalizes to log2(n)+1 bits for
n threads, where n is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
    unsigned v = ++*barrier;
    if (v & (n - 1)) {
        for (v &= n; (*barrier&n) == v;);
    }
}


Note: It never makes sense for n to exceed the logical core count!
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.

If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a uint64_t — an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the & operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.

While C11 _Atomic seems like it would be useful, unsurprisingly it is
not supported by one major, stubborn implementation. If you’re
using C++11 or later, then go ahead use std::atomic since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.

#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
    int v = BARRIER_INC(barrier);
    if (v & (n - 1)) {
        for (v &= n; (BARRIER_GET(barrier)&n) == v;);
    }
}


This has the nice bonus that the interface does not have the _Atomic
qualifier, nor std::atomic template. It’s just a plain old int, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.

If you’d like to try the experiment yourself: reorder.c. If
you’d like to see a test of Go and C sharing a thread barrier:
coop.go.

I’m intentionally not providing the spin-lock barrier as a library. First,
it’s too trivial and small for that, and second, I believe context is
everything. Now that you understand the principle, you can whip up
your own, custom-tailored implementation when the situation calls for it,
just as the one in my experiment is hard-coded for exactly two threads.




Some sanity for C and C++ development on Windows
2021-12-30T23:25:53Z
A hard reality of C and C++ software development on Windows is that there
has never been a good, native C or C++ standard library implementation for
the platform. A standard library should abstract over the underlying host
facilities in order to ease portable software development. On Windows, C
and C++ is so poorly hooked up to operating system interfaces that most
portable or mostly-portable software — programs which work perfectly
elsewhere — are subtly broken on Windows, particularly outside of the
English-speaking world. The reasons are almost certainly political,
originally motivated by vendor lock-in, than technical, which adds insult
to injury. This article is about what’s wrong, how it’s wrong, and some
easy techniques to deal with it in portable software.

There are multiple C implementations, so how could they all be
bad, even the early ones? Microsoft’s C runtime has defined how
the standard library should work on the platform, and everyone else
followed along for the sake of compatibility. I’m excluding Cygwin and
its major fork, MSYS2, despite not inheriting any of these flaws. They
change so much that they’re effectively whole new platforms, not truly
“native” to Windows.

In practice, C++ standard libraries are implemented on top of a C standard
library, which is why C++ shares the same problems. CPython dodges these
issues: Though written in C, on Windows it bypasses the broken C standard
library and directly calls the proprietary interfaces. Other language
implementations, such “gc” Go, simply aren’t built on C at all, and
instead do things correctly in the first place — the behaviors the C
runtimes should have had all along.

If you’re just working on one large project, bypassing the C runtime isn’t
such a big deal, and you’re likely already doing so to access important
platform functionality. You don’t really even need a C runtime. However,
if you write many small programs, as I do, writing the same
special Windows support for each one ends up being most of the work, and
honestly makes properly supporting Windows not worth the trouble. I end up
just accepting the broken defaults most of the time.

Before diving into the details, if you’re looking for a quick-and-easy
solution for the Mingw-w64 toolchain, including w64devkit, which
magically makes your C and C++ console programs behave well on Windows,
I’ve put together a “library” named libwinsane. It solves all
problems discussed in this article, except for one. No source changes
required, simply link it into your program.

What exactly is broken?

The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix,
and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy
API, where an active code page maps 256 bytes onto (up to) 256 specific
characters. On typical machines configured for European languages, this
means code page 1252. Roughly speaking, Windows
internally uses UTF-16, and calls through the narrow interface use the
active code page to translate the narrow strings to wide strings. The
result is that calls through the narrow API have limited access to the
system.

The UTF-8 encoding was invented in 1992 and standardized by January 1993.
UTF-8 was adopted by the unix world over the following years due to its
backwards-compatibility with its existing interfaces. Programs
could read and write Unicode data, access Unicode paths, pass Unicode
arguments, and get and set Unicode environment variables without needing
to change anything. Today UTF-8 has become the dominant text encoding
format in the world, in large part due to the world wide web.

In July 1993, Microsoft introduced the wide Windows API with the release
of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather
than UTF-8. This turned out to be a mistake, since UTF-16 is inferior to
UTF-8 in practically every way, though admittedly some problems
weren’t so obvious at the time.

The major problem: The C and C++ standard libraries only hook up to the
narrow Windows interfaces. The standard library, and therefore typical
portable software on Windows, cannot handle anything but ASCII. The
effective result is that these programs:


  Cannot accept non-ASCII arguments
  Cannot get/set non-ASCII environment variables
  Cannot access non-ASCII paths
  Cannot read and write non-ASCII on a console


Doing any of these requires calling proprietary functions, treating
Windows as a special target. It’s part of what makes correctly porting
software to Windows so painful.

The sensible solution would have been for the C runtime to speak UTF-8 and
connect to the wide API. Alternatively, the narrow API could have been
changed over to UTF-8, phasing out the old code page concept. In theory
this is what the UTF-8 “code page” is about, though it doesn’t always
work. There would have been compatibility problems with abruptly making
such a change, but until very recently, this wasn’t even an option. Why
couldn’t there be a switch I could flip to get sane behavior that works
like every other platform?

How to mostly fix Unicode support

In 2019, Microsoft introduced a feature to allow programs to request
UTF-8 as their active code page on start, along with supporting
UTF-8 on more narrow API functions. This is like the magic switch I
wanted, except that it involves embedding some ugly XML into your binary
in a particular way. At least it’s now an option.

For Mingw-w64, that means writing a resource file like so:

#include 
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"


Compiling it with windres:

$ windres -o manifest.o manifest.rc


Then linking that into your program. Amazingly it mostly works! Programs
can access Unicode arguments, Unicode environment variables, and Unicode
paths, including with fopen, just as it’s worked on other platforms for
decades. Since the active code page is set at load time, it happens before
argv is constructed (from GetCommandLineA), which is why that works
out.

Alternatively you could create a “side-by-side assembly” placing that XML
in a file with the same name as your EXE but with .manifest suffix
(after the .exe suffix), then placing that next to your EXE. Just be
mindful that there’s a “side-by-side” cache (WinSxS), and so it might not
immediately pick up your changes.

What doesn’t work is console input and output since the console is
external to the process, and so isn’t covered by the process’s active code
page. It must be configured separately using a proprietary call:

SetConsoleOutputCP(CP_UTF8);


Annoying, but at least it’s not that painful. This only covers output,
though, meaning programs can only print UTF-8. Unfortunately UTF-8 input
still doesn’t work, and setting the input code page doesn’t do
anything despite reporting success:

SetConsoleCP(CP_UTF8);  // doesn't work


If you care about reading interactive Unicode input, you’re stuck
bypassing the C runtime since it’s still broken.

Text stream translation

Another long-standing issue is that C and C++ on Windows has distinct
“text” and “binary” streams, which it inherited from DOS. Mainly this
means automatic newline conversion between CRLF and LF. The C standard
explicitly allows for this, though unix-like platforms have never actually
distinguished between text and binary streams.

The standard also specifies that standard input, output, and error are all
open as text streams, and there’s no portable method to change the stream
mode to binary — a serious deficiency with the standard. On unix-likes
this doesn’t matter, but on Windows it means programs can’t read or write
binary data on standard streams without calling a non-standard function.
It also means reading and writing standard streams is slow, frequently a
bottleneck unless I route around it.

Personally, I like writing binary data to standard output,
including video, and sometimes binary filters that also read
binary input. I do it so often that in probably half my C programs I have
this snippet in main just so they work correctly on Windows:

    #ifdef _WIN32
    int _setmode(int, int);
    _setmode(0, 0x8000);
    _setmode(1, 0x8000);
    #endif


That incantation sets standard input and output in the C runtime to binary
mode without the need to include a header, making it compact, simple, and
self-contained.

This built-in newline translation, along with the Windows standard text
editor, Notepad, lagging decades behind, meant that many other
programs, including Git, grew their own, annoying, newline conversion
misfeatures that cause other problems.

libwinsane

I introduced libwinsane at the beginning of the article, which fixes all
this simply by being linked into a program. It includes the magic XML
manifest .rsrc section, configures the console for UTF-8 output, and
sets standard streams to binary before main (via a GCC constructor). I
called it a “library”, but it’s actually a single object file. It can’t be
a static library since it must be linked into the program despite not
actually being referenced by the program.

So normally this program:

#include 
#include 

int main(int argc, char **argv)
{
    char *arg = argv[argc-1];
    size_t len = strlen(arg);
    printf("%zu %s\n", len, arg);
}


Compiled and run:

C:\>cc -o example example.c
C:\>example π
1 p


As usual, the Unicode argument is silently mangled into one byte. Linked
with libwinsane, it just works like everywhere else:

C:\>gcc -o example example.c libwinsane.o
C:\>example π
2 π


If you’re maintaining a substantial program, you probably want to copy and
integrate the necessary parts of libwinsane into your project and build,
rather than always link against this loose object file. This is more for
convenience and for succinctly capturing the concept. You may even want to
enable ANSI escape processing in your version.

Update December 2024: Pavel Galkin demonstrates how libwinsane.o
changes the console state, which affects all processes associated
with the terminal. This is mostly unavoidable, and it’s one reason I’ve
since concluded that UTF-8 manifests are a poor solution. Better to solve
the problem using a platform layer.




More DLL fun with w64devkit: Go, assembly, and Python
2021-06-29T21:50:30Z
My previous article explained how to work with dynamic-link libraries
(DLLs) using w64devkit. These techniques also apply to other
circumstances, including with languages and ecosystems outside of C and
C++. In particular, w64devkit is a great complement to Go and reliably
fullfills all the needs of cgo — Go’s C interop — and can even
bootstrap Go itself. As before, this article is in large part an exercise
in capturing practical information I’ve picked up over time.

Go: bootstrap and cgo

The primary Go implementation, confusingly named “gc”, is an
incredible piece of software engineering. This is apparent when
building the Go toolchain itself, a process that is fast, reliable, easy,
and simple. It was originally written in C, but was re-written in Go
starting with Go 1.5. The C compiler in w64devkit can build the original C
implementation which then can be used to bootstrap any more recent
version. It’s so easy that I personally never use official binary releases
and always bootstrap from source.

You will need the Go 1.4 source, go1.4-bootstrap-20171003.tar.gz.
This “bootstrap” tarball is the last Go 1.4 release plus a few additional
bugfixes. You will also need the source of the actual version of Go you
want to use, such as Go 1.16.5 (latest version as of this writing).

Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use cmd.exe explicitly.

$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ && cmd /c make)


In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use
it to build the desired toolchain. You can move this new toolchain after
it’s built if necessary.

$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ && cmd /c make)


At this point you can delete the bootstrap toolchain. You probably also
want to put Go on your PATH.

$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" >>~/.profile
$ source ~/.profile


Not only is Go now available, so is the full power of cgo. (Including its
costs if used.)

Vim suggestions

Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
goimports and a couple of corrections to Vim’s built-in Go support ([[
and ]] navigation). The included ctags understands Go, so tags
navigation works the same as it does with C. \i saves the current
buffer, runs goimports, and populates the quickfix list with any errors.
Similarly :make invokes go build and, as expected, populates the
quickfix list.

autocmd FileType go setlocal makeprg=go\ build
autocmd FileType go map <silent> <buffer> <leader>i
    \ :update \|
    \ :cexpr system("goimports -w " . expand("%")) \|
    \ :silent edit<cr>
autocmd FileType go map <buffer> [[
    \ ?^\(func\\|var\\|type\\|import\\|package\)\><cr>
autocmd FileType go map <buffer> ]]
    \ /^\(func\\|var\\|type\\|import\\|package\)\><cr>


Go only comes with gofmt but goimports is just one command away, so
there’s little excuse not to have it:

$ go install golang.org/x/tools/cmd/goimports@latest


Thanks to GOPROXY, all Go dependencies are accessible without (or before)
installing Git, so this tool installation works with nothing more than
w64devkit and a bootstrapped Go toolchain.

cgo DLLs

The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
import "C". The imported C object provides access to C types and
functions. Go functions marked with an //export comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.

To illustrate, here’s an little C interface. To keep it simple, I’ve
specifically sidestepped some more complicated issues, particularly
involving memory management.

// Which DLL am I running?
int version(void);

// Generate 64 bits from a CSPRNG.
unsigned long long rand64(void);

// Compute the Euclidean norm.
float dist(float x, float y);


Here’s a C implementation which I’m calling “version 1”.

#include 
#include 
#include 

__declspec(dllexport)
int
version(void)
{
    return 1;
}

__declspec(dllexport)
unsigned long long
rand64(void)
{
    unsigned long long x;
    RtlGenRandom(&x, sizeof(x));
    return x;
}

__declspec(dllexport)
float
dist(float x, float y)
{
    return sqrtf(x*x + y*y);
}


As discussed in the previous article, each function is exported using
__declspec so that they’re available for import. As before:

$ cc -shared -Os -s -o hello1.dll hello1.c


Side note: This could be trivially converted into a C++ implementation
just by adding extern "C" to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.

Suppose we wanted to implement this in Go instead of C. We already have
all the tools needed to do so. Here’s a Go implementation, “version 2”:

package main

import "C"
import (
	"crypto/rand"
	"encoding/binary"
	"math"
)

//export version
func version() C.int {
	return 2
}

//export rand64
func rand64() C.ulonglong {
	var buf [8]byte
	rand.Read(buf[:])
	r := binary.LittleEndian.Uint64(buf[:])
	return C.ulonglong(r)
}

//export dist
func dist(x, y C.float) C.float {
	return C.float(math.Sqrt(float64(x*x + y*y)))
}

func main() {
}


Note the use of C types for all arguments and return values. The main
function is required since this is the main package, but it will never be
called. The DLL is built like so:

$ go build -buildmode=c-shared -o hello2.dll hello2.go


Without the -o option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.

What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using --out-implib. For Go we have to handle this ourselves via
gendef and dlltool.

$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def


The only way anyone upgrading would know version 2 was implemented in Go
is that the DLL is a lot bigger (a few MB vs. a few kB) since it now
contains an entire Go runtime.

NASM assembly DLL

We could also go the other direction and implement the DLL using plain
assembly. It won’t even require linking against a C runtime.

w64devkit includes two assemblers: GAS (Binutils) which is used by GCC,
and NASM which has friendlier syntax. I prefer the latter whenever
possible — exactly why I included NASM in the distribution. So here’s how
I implemented “version 3” in NASM assembly.

bits 64

section .text

global DllMainCRTStartup
export DllMainCRTStartup
DllMainCRTStartup:
	mov eax, 1
	ret

global version
export version
version:
	mov eax, 3
	ret

global rand64
export rand64
rand64:
	rdrand rax
	ret

global dist
export dist
dist:
	mulss  xmm0, xmm0
	mulss  xmm1, xmm1
	addss  xmm0, xmm1
	sqrtss xmm0, xmm0
	ret


The global directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
export directive is Windows-specific and is equivalent to dllexport in
C.

Every DLL must have an entrypoint, usually named DllMainCRTStartup. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.

Here’s how to assemble and link the DLL:

$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o


Call the DLLs from Python

Python has a nice, built-in C interop, ctypes, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:

import ctypes

def load(version):
    hello = ctypes.CDLL(f"./hello{version}.dll")
    hello.version.restype = ctypes.c_int
    hello.version.argtypes = ()
    hello.dist.restype = ctypes.c_float
    hello.dist.argtypes = (ctypes.c_float, ctypes.c_float)
    hello.rand64.restype = ctypes.c_ulonglong
    hello.rand64.argtypes = ()
    return hello

for hello in load(1), load(2), load(3):
    print("version", hello.version())
    print("rand   ", f"{hello.rand64():016x}")
    print("dist   ", hello.dist(3, 4))


After loading the DLL with CDLL the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:

$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0


That output is the result of four different languages interfacing in one
process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!




How to build and use DLLs on Windows
2021-05-31T02:13:40Z
I’ve recently been involved with a couple of discussions about Windows’
dynamic linking. One was Joe Nelson in considering how to make
libderp accessible on Windows, and the other was about w64devkit,
my Mingw-w64 distribution. I use these techniques so infrequently that I
need to figure it all out again each time I need it. Unfortunately there’s
a whole lot of outdated and incorrect information online which gets in the
way every time this happens. While it’s all fresh in my head, I will now
document what I know works.

In this article, all commands and examples are being run in the context of
w64devkit (1.8.0).

Mingw-w64

If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, square,
that returns the square of its input (square.c):

long square(long x)
{
    return x * x;
}


The header file (square.h):

#ifndef SQUARE_H
#define SQUARE_H

long square(long);

#endif


To build a stripped, size-optimized DLL, square.dll:

$ cc -shared -Os -s -o square.dll square.c


Now a test program to link against it (main.c), which “imports” square
from square.dll:

#include 
#include "square.h"

int main(void)
{
    printf("%ld\n", square(2));
}


Linking and testing it:

$ cc -Os -s main.c square.dll
$ ./a
4


It’s that simple. Or more traditionally, using the -l flag:

$ cc -Os -s -L. main.c -lsquare


Given -lxyz GCC will look for xyz.dll in the library path.

Viewing exported symbols

Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s nm -D, but despite what
the internet will tell you, this tool does not support DLLs. objdump
will print the exports as part of the “private” headers (-p). A bit of
awk can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, exports.sh, that composes objdump and
awk into the tool I want:

#!/bin/sh
set -e
printf 'LIBRARY %s\nEXPORTS\n' "$1"
objdump -p "$1" | awk '/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'


Running this on square.dll above:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square


This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
.def file format (also), which will be particularly
useful in a moment.

Mingw-w64 has a gendef tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:

$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square


Alternatively Visual Studio provides dumpbin. It’s not as concise as
exports.sh but it’s a lot less verbose than objdump -p.

$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...


Mingw-w64 (improved)

You can get by without knowing anything more, which is usually enough for
those looking to support Windows as a secondary platform, even just as a
cross-compilation target. However, with a bit more work we can do better.
Imagine doing the above with a non-trivial program. GCC doesn’t know which
functions are part of the API and which are not. Obviously static
functions should not be exported, but what about non-static functions
visible between translation units (i.e. object files)?

For instance, suppose square.c also has this function which is not part
of its API but may be called by another translation unit.

void internal_func(void) {}


Now when I build:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square


On the other side, when I build main.c how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).

On Windows both are solved by adding __declspec notation on both sides.
In square.c the exports are marked as dllexport:

__declspec(dllexport)
long square(long x)
{
    return x * x;
}

void internal_func(void) {}


In the header, it’s marked as an import:

__declspec(dllimport)
long square(long);


The mere presence of dllexport tells the linker to only export those
functions marked as exports, and so internal_func disappears from the
exports list. Convenient!

On the import side, during compilation of the original program, GCC
assumed square wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a PLT). With
dllimport, GCC knows it’s an imported function and so doesn’t go through
a trampoline.

While generally unnecessary for the GNU toolchain, it’s good hygiene to
use __declspec. It’s also mandatory when using MSVC, in case you
care about that as well.

MSVC

Mingw-w64-compiled DLLs will work with LoadLibrary out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (loadlib.c):

#include 
#include 

int main(void)
{
    HANDLE h = LoadLibrary("square.dll");
    long (*square)(long) = GetProcAddress(h, "square");
    printf("%ld\n", square(2));
}


Compiled with MSVC cl (via vcvars.bat):

$ cl /nologo loadlib.c
$ ./loadlib
4


However, the MSVC linker, unlike Binutils ld, cannot link directly with
DLLs. It requires an import library. Conventionally this matches the DLL
name but has a .lib extension — square.lib in this case. The Mingw-w64
ecosystem conventionally uses .dll.a, as in square.dll.a, in order to
distinguish it from a static library, but it’s the same format. The most
convenient way to get an import library is to ask GCC to generate one at
link-time via --out-implib:

$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c


Back to cl, just add square.lib as another input. You don’t actually
need square.dll present at link time.

$ cl /nologo /Os main.c square.lib
$ ./main
4


What if you already have the DLL and you just need an import library? GNU
Binutils’ dlltool can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a .def
file enumerating the exports. (Why?) What luck that we have a tool for
this!

$ ./exports.sh square.dll >square.def
$ dlltool --input-def square.def --output-lib square.lib


Reversing directions

Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with dllexport. The /LD (case
sensitive) is just like GCC’s -shared.

$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4


cl outputs three files: square.dll, square.lib, and square.exp.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.

This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at
least for C interfaces that don’t share CRT objects.

Tying it all together

If your program is designed to be portable, those __declspec will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.

The strategy will be to mark all API functions with SQUARE_API and
expand that to whatever is necessary at the time. When building a library,
it will expand to dllexport, or default visibility on unix-likes. When
consuming a library it will expand to dllimport, or nothing outside of
Windows. The new square.h:

#ifndef SQUARE_H
#define SQUARE_H

#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif

SQUARE_API
long square(long);

#endif


The new square.c:

#define SQUARE_BUILD
#include "square.h"

SQUARE_API
long square(long x)
{
    return x * x;
}


main.c remains the same. When compiling on unix-like systems, add the
-fvisibility=hidden to hide all symbols by default so that this macro
can reveal them.

$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4


Makefile ideas

While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different Makefiles per toolchain
and target. Aside from the SQUARE_API type of macros, the source code
can fortunately remain fairly agnostic about it.

Here’s what I might use as NMakefile for MSVC nmake:

CC     = cl /nologo
CFLAGS = /Os

all: main.exe square.dll square.lib

main.exe: main.c square.h square.lib
	$(CC) $(CFLAGS) main.c square.lib

square.dll: square.c square.h
	$(CC) /LD $(CFLAGS) square.c

square.lib: square.dll

clean:
	-del /f main.exe square.dll square.lib square.exp


Usage:

nmake /nologo /f NMakefile


For w64devkit and cross-compiling, Makefile.w64, which includes
import library generation for the sake of MSVC consumers:

CC      = cc
CFLAGS  = -Os
LDFLAGS = -s
LDLIBS  =

all: main.exe square.dll square.lib

main.exe: main.c square.dll square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c square.dll $(LDLIBS)

square.dll: square.c square.h
	$(CC) -shared -Wl,--out-implib,$(@:dll=lib) \
	    $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

square.lib: square.dll

clean:
	rm -f main.exe square.dll square.lib


Usage:

make -f Makefile.w64


And a Makefile for everyone else:

CC      = cc
CFLAGS  = -Os -fvisibility=hidden
LDFLAGS = -s
LDLIBS  =

all: main libsquare.so

main: main.c libsquare.so square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c ./libsquare.so $(LDLIBS)

libsquare.so: square.c square.h
	$(CC) -shared $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

clean:
	rm -f main libsquare.so


Now that I have this article, I’m glad I won’t have to figure this all out
again next time I need it!




A guide to Windows application development using w64devkit
2021-03-11T01:40:31Z
There’s a trend of building services where a monolithic application is
better suited, or using JavaScript and Python then being stumped by their
troublesome deployment story. This leads to solutions like bundling an
entire web browser with an application, or using containers to
circumscribe a sprawling dependency tree made of mystery meat.

My small development distribution for Windows, w64devkit,
is my own little way of pushing back against this trend where it affects
me most. Following in the footsteps of projects like Handmade Hero
and Making a Video Game from Scratch, this is my guide to
no-nonsense software development using my development kit. It’s an
overview of the tooling and development workflow, and I’ve tried not to
assume too much knowledge of the reader. Being a guide rather than manual,
it is incomplete on its own, and I link to substantial external resources
to fill in the gaps. The guide is capped with a small game I wrote
entirely using my development kit, serving as a demonstration of what
sorts of things are not only possible, but quite reasonably attainable.






Game repository: https://github.com/skeeto/asteroids-demo

Guide to source: Understanding Asteroids

Initial setup

Of course you cannot use the development kit if you don’t have it yet. Go
to the releases section and download the latest release. It will be
a .zip file named w64devkit-x.y.z.zip where x.y.z is the version.

You will need to unzip the development kit before using it. Windows has
built-in support for .zip files, so you can either right-click to access
“Extract All…” or navigate into it as a folder then drag-and-drop the
w64devkit directory somewhere outside the .zip file. It doesn’t care
where it’s unzipped (aka it’s “portable”), so put it where ever is
convenient: your desktop, user profile directory, a thumb drive, etc. You
can move it later if you change your mind just so long as you’re not
actively running it. If you decide you don’t need it anymore then delete
it.

Entering the development environment

There is a w64devkit.exe in the unzipped w64devkit directory. This is
the easiest way to enter the development environment, and will not require
system configuration changes. This program puts the kit’s programs in the
PATH environment variable then runs a Bourne shell — the standard unix
shell. Aside from the text editor, this is the primary interface for
developing software. In time you may even extend this environment with
your own tools.

If you want an additional “terminal” window, run w64devkit.exe again. If
you use it a lot, you may want to create a shortcut and even pin it to
your task bar.

Whether on Windows or unix-like systems, when you type a command into the
system shell it uses the PATH environment variable to locate the actual
program to run for that command. In practice, the PATH variable is a
concatenation of multiple directories, and the shell searches these
directories in order. On unix-like systems, PATH elements are separated
by colons. However, Windows uses colons to delimit drive letters, so its
PATH elements are separated by semicolons.

# Prepending to PATH on unix
PATH="$HOME/bin:$PATH"

# Prepending to PATH on Windows (w64devkit)
PATH="$HOME/bin;$PATH"


For more advanced users: Rather than use w64devkit.exe, you could “Edit
environment variables for your account” and manually add w64devkit’s bin
directory to your PATH, making the tools generally available everywhere
on your system. If you’ve gone this route, you can start a Bourne shell at
any time with sh -l. (The -l option requests a login shell.)

Also borrowed from the unix world is the concept of a home directory,
specified by the HOME environment variable. By default this will be your
user profile directory, typically C:/Users/$USER. Login shells always
start in the home directory. This directory is often indicated by tilde
(~), and many programs automatically expand a leading tilde to the home
directory.

Shell basics

The shell is a command interpreter. It’s named such because it was
originally a shell around the operating system kernel — the user
interface to the kernel. Your system’s graphical interface — Windows
Explorer, or Explorer.exe — is really just a kind of shell, too. That
shell is oriented around the mouse and graphics. This is fine for some
tasks, but a keyboard-oriented command shell is far better suited for
development tasks. It’s more efficient, but more importantly its features
are composable: Complex operations and processes can be constructed
from simple, easy-to-understand tools. Embrace it!

In the shell you can navigate between directories with cd, make
directories with mkdir, remove files with rm, regular expression text
searches with grep, etc. Run busybox to see a listing of the available
standard commands. Unfortunately there are no manual pages, but you can
access basic usage information for any command with busybox CMD --help.

Windows’ standard command shell is cmd.exe. Unfortunately this shell is
terrible and exists mostly for legacy compatibility. The intended
replacement is PowerShell for users who regularly use a shell. However,
PowerShell is fundamentally broken, does virtually everything incorrectly,
and manages to be even worse than cmd.exe. Besides, sticking to POSIX
shell conventions significantly improves build portability, and unix tool
knowledge is transferable to basically every other operating system.

Unix’s standard shell was the Bourne shell, sh. The shells in use today
are Bourne shell clones with a superset of its features. The most popular
interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist
shell) has become popular for non-interactive use (scripting). The shell
included with w64devkit is the BusyBox fork of the Almquist shell (ash),
closely related to dash. The Almquist shell has almost no non-interactive
features beyond the standard Bourne shell, and so as far as scripts are
concerned can be regarded as a plain Bourne shell clone. That’s why I
typically refer to it by the name sh.

However, BusyBox’s Almquist shell has interactive features much like Bash,
and Bash users should be quite comfortable. It’s not just tab-completion
but a slew of Emacs-like keybindings:


  Ctrl-r: search backwards in history
  Ctrl-s: search forwards in history
  Ctrl-p: previous command (Up)
  Ctrl-n: next command (Down)
  Ctrl-a: cursor to the beginning of line (Home)
  Ctrl-e: cursor to the end of line (End)
  Alt-b: cursor back one word
  Alt-f: cursor forward one word
  Ctrl-l: clear the screen
  Alt-d: delete word after the cursor
  Ctrl-w: delete the word before the cursor
  Ctrl-k: delete to the end of the line
  Ctrl-u: delete to the beginning of the line
  Ctrl-f: cursor forward one character (Right)
  Ctrl-b: cursor backward one character (Left)
  Ctrl-d: delete character under the cursor (Delete)
  Ctrl-h: delete character before the cursor (Backspace)


Take special note of Ctrl-r, which is the most important and powerful
shortcut of the bunch. Frequent use is a good habit. Don’t mash the up
arrow to search through the command history.

Special note for Cygwin and MSYS2 users: the shell is aware of Windows
paths and does not present a virtual unix file system scheme. This has
important consequences for scripting, both good and bad. The shell even
supports backslash as a directory separator, though you should of course
prefer forward slashes.

Shell customization

Login shells (-l) evaluate the contents of ~/.profile on startup. This
is your chance to customize the shell configuration, such as setting
environment variables or defining aliases and functions. For instance, if
you wanted the prompt to show the working directory in green you’d set
PS1 in your ~/.profile:

PS1="$(printf '\x1b[33;1m\\w\x1b[0m$ ')"


If you find yourself using the same command sequences or set of options
again and again, you might consider putting those commands into a script,
and then installing that script somewhere on your PATH so that you can
run it as a new command. First make a directory to hold your scripts, say
in ~/bin:

mkdir ~/bin


In ~/.profile prepend it to your PATH:

PATH="$HOME/bin;$PATH"


If you don’t want to start a fresh shell to try it out, then load the new
configuration in your current shell:

source ~/.profile


Suppose you keep getting the tar switches mixed up and you’d like to
just have an untar command that does the right thing. Create a file
named untar or untar.sh in ~/bin with these contents:

#!/bin/sh
set -e
tar -xaf "$@"


Now a command like untar something.tar.gz will extract the archive
contents.

To learn more about Bourne shell scripting, the POSIX shell command
language specification is a good reference. All of the features
listed in that document are available to your shell scripts.

Text editing

The development kit includes the powerful and popular text editor
Vim. It takes effort to learn, but is well worth the investment.
It’s packed with features, but since you only need a small number of them
on a regular basis it’s not as daunting as it might appear. Using Vim
effectively, you will write and edit text so much more quickly than
before. That includes not just code, but prose: READMEs, documentation,
etc.

(The catch: Non-modal editing will forever feel frustratingly inefficient.
That’s not because you will become unpracticed at it, or even have trouble
code switching between input styles, but because you’ll now be aware how
bad it is. Ignorance is bliss.)

Vim includes its own tutorial for absolute beginners which you can access
with the vimtutor command. It will run in the console window and guide
you through the basics in about half an hour. Do not be afraid to return
to the tutorial at any time since this is the stuff you need to know by
heart.

When it comes time to actually use Vim to write code, you can continue
writing code via the terminal interface (vim), or you can run the
graphical interface (gvim). The latter is recommended since it has some
nice quality-of-life features, but it’s not strictly necessary. When
starting the GUI, put an ampersand (&) on the command so that it runs in
the background. For instance this brings up the editor with two files open
but leaves the shell running in the foreground so you can continue using
it while you edit:

gvim main.c Makefile &


Vim’s defaults are good but imperfect. Before getting started with
actually editing code you should establish at least the following minimal
configuration in ~/_vimrc. (To understand these better, use :help to
jump the built-in documentation.)

set hidden encoding=utf-8 shellslash
filetype plugin indent on
syntax on


The graphical interface defaults to a white background. Many people prefer
“dark mode” when editing code, so inverting this is simply a matter of
choosing a dark color scheme. Vim comes with a handful of color schemes,
around half of which have dark backgrounds. Use :colorscheme to change
it, and put it in your ~/_vimrc to persist it.

colorscheme slate


The default graphical interface includes a menu bar and tool bar. There
are better ways to accomplish all these operations, none of which require
touching the mouse, so consider removing all that junk:

set guioptions=ac


Finally, since the development kit is oriented around C and C++, here’s my
own entire Vim configuration for C which makes it obey my own style:

set cinoptions+=t0,l1,:0 cinkeys-=0#


Once you’re comfortable with the basics, the best next step is to read
Practical Vim: Edit Text at the Speed of Thought by Drew Neil.
It’s an opinionated guide to Vim that instills good habits. If you want
something cost-free to whet your appetite, check out Seven habits of
effective text editing.

Writing an application

We’ve established a shell and text editor. Next is the development
workflow for writing an actual application. Ultimately you will invoke a
compiler from within Vim, which will parse compiler messages and take you
directly to the parts of your source code that need attention. Before we
get that far, let’s start with the basics.

The classic example is the “hello world” program, which we’ll suppose is
in a file called hello.c:

#include 

int main(void)
{
    puts("Hello, world!");
}


While this development kit provides a version of the GNU compiler, gcc,
this guide mostly speaks of it in terms of the generic unix C compiler
name, cc. Unix-like systems install cc as an alias for the system’s
default C compiler, and w64devkit is no exception.

cc -o hello.exe hello.c


This command creates hello.exe from hello.c. Since this is not (yet?)
on your PATH, you must invoke it via a path name (i.e. the command must
include a slash), since otherwise the shell will search for it via the
PATH variable. Typically this means putting ./ in front of the program
name, meaning “run the program in the current directory”. As a convenience
you do not need to include the .exe extension:

./hello


Unlike the untar shell script from before, this hello.exe is entirely
independent of w64devkit. You can share it with anyone running Windows and
they’ll be able to execute it. There’s a little bit of runtime embedded in
the executable, but the bulk of the runtime is in the operating system
itself. I want to highlight this point because most programming languages
don’t work like this, or at least doing so is unnatural with lots of
compromises. The users of your software do not need to install a runtime
or other supporting software. They just run the executable you give them!

That executable is probably pretty small, less than 50kB — basically a
miracle by today’s standards. Sure, it’s hardly doing anything right now,
but you can add a whole lot more functionality without that executable
getting much bigger. In fact, it’s entirely unoptimized right now and
could be even smaller. Passing the -Os flag tells the compiler to
optimize for size and -s flag tells the linker to strip out unneeded
information.

cc -Os -s -o hello.exe hello.c


That cuts the program down to around a third of its previous size. If
necessary you can still do even better than this, but that’s outside the
scope of this guide.

So far the program could still be valid enough to compile but contain
obvious mistakes. The compiler can warn about many of these mistakes, and
so it’s always worth enabling these warnings. This requires two flags:
-Wall (“all” warnings) and -Wextra (extra warnings).

cc -Wall -Wextra -o hello.exe hello.c


When you’re working on a program, you often don’t want optimization
enabled since it makes it more difficult to debug. However, some warnings
aren’t fired unless optimization is enabled. Fortunately there’s an
optimization level to resolve this, -Og (optimize for debugging).
Combine this with -g3 to embed debug information in the program. This
will be handy later.

cc -Wall -Wextra -Og -g3 -o hello.exe hello.c


These are the compiler flags you typically want to enable while developing
your software. When you distribute it, you’d use either -Os -s (optimize
for size) or -O3 -s (optimize for speed).

Makefiles

I mentioned running the compiler from Vim. This isn’t done directly but
via special build script called a Makefile. You invoke the make program
from Vim, which invokes the compiler as above. The simplest Makefile would
look like this, in a file literally named Makefile:

hello.exe: hello.c
    cc -Wall -Wextra -Og -g3 -o hello.exe hello.c


This tells make that the file named hello.exe is derived from another
file called hello.c, and the tab-indented line is the recipe for doing
so. Running the make command will run the compiler command if and only
if hello.c is newer than hello.exe.

To run make from Vim, use the :make command inside Vim. It will not
only run make but also capture its output in an internal buffer called
the quickfix list. If there is any warning or error, Vim will jump to
it. Use :cn (next) and :cp (prev) to move between issues and correct
them, or :cc to re-display the current issue. When you’re done fixing
the issues, run :make again to start the cycle over.

Try that now by changing the printed message and recompiling from within
Vim. Intentionally create an error (bad syntax, too many arguments, etc.)
and see what happens.

Makefiles are a powerful and conventional way to build C and C++ software.
Since the development kit includes the standard set of unix utilities,
it’s very easy to write portable Makefiles that work across a variety a
operating systems and environments. Your software isn’t necessarily tied
to Windows just because you’re using a Windows-based development
environment. If you want to learn how Makefiles work and how to use them
effectively, read A Tutorial on Portable Makefiles. From here on
I’ll assume you’ve read that tutorial.

Ultimately I’d probably write my “hello world” Makefile like so:

.POSIX:
CC      = cc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =
EXE     = .exe

hello$(EXE): hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)


When building a release, optimize for size or speed:

make CFLAGS=-Os LDFLAGS=-s


This is very much a Windows-first style of Makefile, but still allows it
to be comfortably used on other systems. On Linux this make invocation
strips away the .exe extension:

make EXE=


For a Windows-second Makefile, remove the line with EXE = .exe. This
allows EXE to come from the environment. So, for instance, I already
define the EXE environment variable in my w64devkit ~/.profile:

export EXE=.exe


On Linux running make does the right thing, as does running make on
Windows. No special configuration required.

If my software is truly limited to Windows, I’m likely still interested in
supporting cross-compilation. A common convention for GNU toolchains is a
CROSS Makefile macro. For example:

.POSIX:
CROSS   =
CC      = $(CROSS)gcc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =

hello.exe: hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)


On Windows I just run make, but on Linux I’d set CROSS appropriately.

make CROSS=x86_64-w64-mingw32-


Navigating

What happens if you’re working on a larger program and you need to jump to
the definition of a function, macro, or variable? It would be tedious to
use grep all the time to find definitions. The development kit includes
a solid implementation of ctags for building a tags database lists the
locations for various kinds of definitions, and Vim knows how to read this
database. Most often you’ll want to run it recursively like so:

ctags -R


You can of course do this from Vim, too: :!ctags -R

With the cursor over an identifier, press CTRL-] to jump to a definition
for that name. Use :tn and :tp to move between different definitions
(e.g. when the name is overloaded). Or if you have a tag in mind rather
than a name listed in the buffer, use the :tag command to jump by name.
Vim maintains a tag stack and jump list for going back and forth, like the
backward and forward buttons in a browser.

Debugging

I had mentioned that the -g3 option embeds extra information in the
executable. This is for debuggers, and the development kit includes the
GNU Debugger, gdb, to help you debug your programs. To use it, invoke
GDB on your executable:

gdb hello.exe


From here you can set breakpoints and such, then run the program with
start or run, then step through it line by line. See Beej’s Quick
Guide to GDB for a guide. During development, always run your
program through GDB, and never exit GDB. See also: Assertions should be
more debugger-oriented.

Learning C and C++

So far this guide hasn’t actually assumed any C knowledge. One of the best
ways to learn C is by reading the highly-regarded The C Programming
Language and doing the exercises. Alternatively, cost-free options
are Beej’s Guide to C Programming and Modern C (more
advanced). You can use the development kit to go through any of these.

I’ve focused on C, but everything above also applies to C++. To learn C++
A Tour of C++ is a safe bet.

Demonstration

To illustrate how much you can do with nothing beyond than this 76MB
development kit, here’s a taste in the form of a weekend project: an
Asteroids Clone for Windows. That’s the game in the video at the
top of this guide.

The development kit doesn’t include Git so you’d need to install it
separately in order to clone the repository, but you could at least skip
that and download a .zip snapshot of the source. It has no third-party
dependencies yet it includes hardware-accelerated graphics, real-time
sound mixing, and gamepad input. Building a larger and more complex game
is much less about tooling and more about time and skill. That’s what I
mean about w64devkit being (almost) everything you need.




Well-behaved alias commands on Windows
2021-02-08T20:32:45Z
Since its inception I’ve faced a dilemma with w64devkit, my
all-in-one Mingw-w64 toolchain and development environment
distribution for Windows. A major goal of the project is no
installation: unzip anywhere and it’s ready to go as-is. However, full
functionality requires alias commands, particularly for BusyBox applets,
and the usual solutions are neither available nor viable. It seemed that
an installer was needed to assemble this last puzzle piece. This past
weekend I finally discovered a tidy and complete solution that solves this
problem for good.

That solution is a small C source file, alias.c. This article is
about why it’s necessary and how it works.

Hard and symbolic links

Some alias commands are for convenience, such as a cc alias for gcc so
that build systems need not assume any particular C compiler. Others are
essential, such as an sh alias for “busybox sh” so that it’s available
as a shell for make. These aliases are usually created with links, hard
or symbolic. A GCC installation might include (roughly) a symbolic link
created like so:

ln -s gcc cc


BusyBox looks at its argv[0] on startup, and if it names an applet
(ls, sh, awk, etc.), it behaves like that applet. Typically BusyBox
aliases are installed as hard links to the original binary, and there’s
even a busybox --install to set these up. Both kinds of aliases are
cheap and effective.

ln busybox sh
ln busybox ls
ln busybox awk


Unfortunately links are not supported by .zip files on Windows. They’d
need to be created by a dedicated installer. As a result, I’ve strongly
recommended that users run “busybox --install” at some point to
establish the BusyBox alias commands. While w64devkit works without them,
it works better with them. Still, that’s an installation step!

An alternative option is to simply include a full copy of the BusyBox
binary for each applet — all 150 of them — simulating hard links. BusyBox
is small, around 4kB per applet on average, but it’s not quite that
small. Since the .zip format doesn’t use block compression — files are
compressed individually — this duplication will appear in the .zip itself.
My 573kB BusyBox build duplicated 150 times would double the distribution
size and increase the installation footprint by 25%. It’s not worth the
cost.

Since .zip is so limited, perhaps I should use a different distribution
format that supports links. However, another w64devkit goal is making no
assumptions about what other tools are installed. Windows natively
supports .zip, even if that support isn’t so great (poor performance, low
composability, missing features, etc.). With nothing more than the
w64devkit .zip on a fresh, offline Windows installation, you can begin
efficiently developing professional, native applications in under a
minute.

Scripts as aliases

With links off the table, the next best option is a shell script. On
unix-like systems shell scripts are an effective tool for creating complex
alias commands. Unlike links, they can manipulate the argument list. For
instance, w64devkit includes a c99 alias to invoke the C compiler
configured to use the C99 standard. To do this with a shell script:

#!/bin/sh
exec cc -std=c99 "$@"


This prepends -std=c99 to the argument list and passes through the rest
untouched via the Bourne shell’s special case "$@". Because I used
exec, the shell process becomes the compiler in place. The shell
doesn’t hang around in the background. It’s just gone. This really quite
elegant and powerful.

The closest available on Windows is a .bat batch file. However, like some
other parts of DOS and Windows, the Batch language was designed as though
its designer once glimpsed at someone using a unix shell, perhaps looking
over their shoulder, then copied some of the ideas without understanding
them. As a result, it’s not nearly as useful or powerful. Here’s the Batch
equivalent:

@cc -std=c99 %*


The @ is necessary because Batch prints its commands by default (Bourne
shell’s -x option), and @ disables it. Windows lacks the concept of
exec(3), so Batch file interpreter cmd.exe continues running alongside
the compiler. A little wasteful but that hardly matters. What does matter
though is that cmd.exe doesn’t behave itself! If you, say, Ctrl+C to
cancel compilation, you will get the infamous “Terminate batch job (Y/N)?”
prompt which interferes with other programs running in the same console.
The so-called “batch” script isn’t a batch job at all: It’s interactive.

I tried to use Batch files for BusyBox applets, but this issue came up
constantly and made this approach impractical. Nearly all BusyBox applets
are non-interactive, and lots of things break when they aren’t. Worst of
all, you can easily end up with layers of cmd.exe clobbering each other
to ask if they should terminate. It was frustrating.

The prompt is hardcoded in cmd.exe and cannot be disabled. Since so much
depends on cmd.exe remaining exactly the way it is, Microsoft will never
alter this behavior either. After all, that’s why they made PowerShell a
new, separate tool.

Speaking of PowerShell, could we use that instead? Unfortunately not:


  
    It’s installed by default on Windows, but is not necessarily enabled.
One of my own use cases for w64devkit involves systems where PowerShell
is disabled by policy. A common policy is it can be used interactively
but not run scripts (“Running scripts is disabled on this system”).
  
  
    PowerShell is not a first class citizen on Windows, and will likely
never be. Even under the friendliest policy it’s not normally possible
to put a PowerShell script on the PATH and run it by name. (I’m sure
there are ways to make this work via system-wide configuration, but
that’s off the table.)
  
  
    Everything in PowerShell is broken. For example, it does not support
input redirection with files, and instead you must use the cat-like
command, Get-Content, to pipe file contents. However, Get-Content
translates its input and quietly damages your data. There is no way to
disable this “feature” in the version of PowerShell that ships with
Windows, meaning it cannot accomplish the simplest of tasks. This is
just one of many ways that PowerShell is broken beyond usefulness.
  


Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts
are still not first class citizens since Windows doesn’t know what to do
with them. Fixing would require system-wide configuration, antithetical to
the philosophy of the project.

Solution: compiled shell “scripts”

My working solution is inspired by an insanely clever hack used by my
favorite media player, mpv. The Windows build is strange at first
glance, containing two binaries, mpv.exe (large) and mpv.com (tiny).
Is that COM as in an old-school 16-bit DOS binary? No, that’s just
a trick that works around a Windows limitation.

The Windows technology is broken up into subsystems. Console programs run
in the Console subsystem. Graphical programs run in the Windows subsystem.
The original WSL was a subsystem. Unfortunately this design means
that a program must statically pick a subsystem, hardcoded into the binary
image. The program cannot select a subsystem dynamically. For example,
this is why Java installations have both java.exe and javaw.exe, and
Emacs has emacs.exe and runemacs.exe. Different binaries for different
subsystems.

On Linux, a program that wants to do graphics just talks to the Xorg
server or Wayland compositor. It can dynamically choose to be a terminal
application or a graphical application. Or even both at once. This is
exactly the behavior of mpv, and it faces a dilemma on Windows: With
subsystems, how can it be both?

The trick is based on the environment variable PATHEXT which tells
Windows how to prioritize executables with the same base name but
different file extensions. If I type mpv and it finds both mpv.exe and
mpv.com, which binary will run? It will be the first listed in
PATHEXT, and by default that starts with:

PATHEXT=.COM;.EXE;.BAT;...


So it will run mpv.com, which is actually a plain old PE+ .exe
in disguise. The Windows subsystem mpv.exe gets the shortcut and file
associations while Console subsystem mpv.com catches command line
invocations and serves as console liaison as it invokes the real
mpv.exe. Ingenious!

I realized I can pull a similar trick to create command aliases — not the
.com trick, but the miniature flagger program. If only I could compile
each of those Batch files to tiny, well-behaved .exe files so that it
wouldn’t rely on the badly-behaved cmd.exe…

Tiny C programs

Years ago I wrote about tiny, freestanding Windows executables.
That research paid off here since that’s exactly what I want. The alias
command program need only manipulate its command line, invoke another
program, then wait for it to finish. This doesn’t require the C library,
just a handful of kernel32.dll calls. My alias command programs can be
so small that would no longer matter that I have 150 of them, and I get
complete control over their behavior.

To compile, I use -nostdlib and -ffreestanding to disable all system
libraries, -lkernel32 to pull that one back in, -Os (optimize for
size), and -s (strip) all to make the result as small as possible.

I don’t want to write a little program for each alias command. Instead
I’ll use a couple of C defines, EXE and CMD, to inject the target
command at compile time. So this Batch file:

@target arg1 arg2 %*


Is equivalent to this alias compilation:

gcc -DEXE="target.exe" -DCMD="target arg1 arg2" \
    -s -Os -nostdlib -ffreestanding -o alias.exe alias.c -lkernel32


The EXE string is the actual module name, so the .exe extension is
required. The CMD string replaces the first complete token of the
command line string (think argv[0]) and may contain arbitrary additional
arguments (e.g. -std=c99). Both are handled as wide strings (L"...")
since the alias program uses the wide Win32 API in order to be fully
transparent. Though unfortunately at this time it makes no difference: All
currently aliased programs use the “ANSI” API since the underlying C and
C++ standard libraries only use the ANSI API. (As far as I know, nobody
has ever written fully-functional C and C++ standard libraries for
Windows, not even Microsoft.)

You might wonder why the heck I’m gluing strings together for the
arguments. These will need to be parsed (word split, etc.) by someone
else, so shouldn’t I construct an argv array instead? That’s not how it
works on Windows: Programs receive a flat command string and are expected
to parse it themselves following the format specification. When
you write a C program, the C runtime does this for you to provide the
usual argv array.

This is upside down. The caller creating the process already has arguments
split into an argv array — or something like it — but Win32 requires the
caller to encode the argv array as a string following a special format so
that the recipient can immediately decode it. Why marshaling rather than
pass structured data in the first place? Why does Win32 only supply a
decoder (CommandLineToArgv) and not an encoder (e.g. the missing
ArgvToCommandLine)? Hey, I don’t make the rules; I just have to live
with them.

You can look at the original source for the details, but the summary is
that I supply my own xstrlen(), xmemcpy(), and partial Win32 command
line parser — just enough to identify the first token, even if that token
is quoted. It glues the strings together, calls CreateProcessW, waits
for it to exit (WaitForSingleObject), retrieves the exit code
(GetExitCodeProcess), and exits with the same status. (The stuff that
comes for free with exec(3).)

This all compiles to a 4kB executable, mostly padding, which is small
enough for my purposes. These compress to an acceptable 1kB each in the
.zip file. Smaller would be nicer, but this would require at minimum a
custom linker script, and even smaller would require hand-crafted
assembly.

This lingering issue solved, w64devkit now works better than ever. The
alias.c source is included in the kit in case you need to make any of
your own well-behaved alias commands.




w64devkit: (Almost) Everything You Need
2020-09-25T00:04:11Z
This article was discussed on Hacker News.

This past May I put together my own C and C++ development
distribution for Windows called w64devkit. The entire
release weighs under 80MB and requires no installation. Unzip and run it
in-place anywhere. It’s also entirely offline. It will never
automatically update, or even touch the network. In mere seconds any
Windows system can become a reliable development machine. (To further
increase reliability, disconnect it from the internet.) Despite
its simple nature and small packaging, w64devkit is almost everything
you need to develop any professional desktop application, from a
command line utility to a AAA game.



I don’t mean this in some useless Turing-complete sense, but in
a practical, get-stuff-done sense. It’s much more a matter of
know-how than of tools or libraries. So then what is this “almost”
about?


  
    The distribution does not have WinAPI documentation. It’s notoriously
difficult to obtain and, besides, unfriendly to redistribution.
It’s essential for interfacing with the operating system and difficult
to work without. Even a dead tree reference book would suffice.
  
  
    Depending on what you’re building, you may still need specialized
tools. For instance, game development requires tools for editing art
assets.
  
  
    There is no formal source control system. Git is excluded per the
issues noted in the announcement, and my next option, Quilt,
has similar limitations. However, diff and patch are included,
and are sufficient for a kind of old-school, patch-based source
control. I’ve used it successfully when dogfooding w64devkit in a
fresh Windows installation.
  


Everything else

As I said in my announcement, w64devkit includes a powerful text editor
that fulfills all text editing needs, from code to documentation. The
editor includes a tutorial (vimtutor) and complete, built-in manual
(:help) in case you’re not yet familiar with it.

What about navigation? Use the included ctags to generate a
tags database (ctags -R), then jump instantly to any
definition at any time. No need for that Language Server Protocol
rubbish. This does not mean you must laboriously type identifiers
as you work. Use built-in completion!

Build system? That’s also covered, via a Windows-aware unix-like
environment that includes make. Learning how to use it is a
breeze. Software is by its nature unavoidably complicated, so don’t
make it more complicated than necessary.

What about debugging? Use the debugger, GDB. Performance problems? Use
the profiler, gprof. Inspect compiler output either by asking for it
(-S) or via the disassembler (objdump -d). No need to go online for
the Godbolt Compiler Explorer, as slick as it is. If the compiler
output is insufficient, use SIMD intrinsics. In the worst case
there are two different assemblers available. Real time graphics? Use an
operating system API like OpenGL, DirectX, or Vulkan.

w64devkit really is nearly everything you need in a single, no
nonsense, fully-offline package! It’s difficult to emphasize this
point as much as I’d like. When interacting with the broader software
ecosystem, I often despair that software development has lost its
way. This distribution is my way of carving out an escape from some
of the insanity. As a C and C++ toolchain, w64devkit by default produces
lean, sane, trivially-distributable, offline-friendly artifacts. All
runtime components in the distribution are static link only,
so no need to distribute DLLs with your application either.

Customize the distribution, own the toolchain

While most users would likely stick to my published releases, building
w64devkit is a two-step process with a single build dependency, Docker.
Anyone can easily customize it for their own needs. Don’t care about
C++? Toss it to shave 20% off the distribution. Need to tune the runtime
for a specific microarchitecture? Tweak the compiler flags.

One of the intended strengths of open source is users can modify
software to suit their needs. With w64devkit, you own the toolchain
itself. It is one of your dependencies after all. Unfortunately
the build initially requires an internet connection even when working
from source tarballs, but at least it’s a one-time event.

If you choose to take on dependencies, and you build those
dependencies using w64devkit, all the better! You can tweak them to your
needs and choose precisely how they’re built. You won’t be relying on
the goodwill of internet randos nor the generosity of a free package
registry.

Customization examples

Building existing software using w64devkit is probably easier than
expected, particularly since much of it has already been “ported” to
MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure
scripts. They never work in w64devkit despite having everything they
technically need. So other than that, here’s a demonstration of building
some popular software.

One of my coworkers uses his own version of PuTTY
patched to play more nicely with Emacs. If you wanted to do the same,
grab the source tarball, unpack it using the provided tools, then in the
unpacked source:

$ make -C windows -f Makefile.mgw


You’ll have a custom-built putty.exe, as well as the other tools. If you
have any patches, apply those first!

Would you like to embed an extension language in your application? Lua
is a solid choice, in part because it’s such a well-behaved dependency.
After unpacking the source tarball:

$ make PLAT=mingw


This produces a complete Lua compiler, runtime, and library. It’s not
even necessary to use the Makefile, as it’s nearly as simple as “cc
*.c” — painless to integrate or embed into any project.

Do you enjoy NetHack? Perhaps you’d like to try a few of the custom
patches. This one is a little more complicated, but I was able to
build NetHack 3.6.6 like so:

$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"


NetHack has a bug necessitating -fcommon. If you have any
patches, apply them with patch before the last step. I won’t belabor it
here, but with just a little more effort I was also able to produce a
NetHack binary with curses support via PDCurses — statically-linked
of course.

How about my archive encryption tool, Enchive? The one that
even works with 16-bit DOS compilers. It requires nothing special
at all!

$ make


w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM.
This means you can modify and recompile these tools without going
through the Docker build. Sadly busybox-w32 cannot host itself,
though it’s close. I’d love if w64devkit could fully host itself, and
so Docker — and therefore an internet connection and such — would only
be needed to bootstrap, but unfortunately that’s not realistic given the
state of the GNU components.

Offline and reliable

Software development has increasingly become dependent on a constant
internet connection. Robust, offline tooling and development is
undervalued.

Consider: Does your current project depend on an external service? Do
you pay for this service to ensure that it remains up? If you pull your
dependencies from a repository, how much do you trust those who maintain
the packages? Do you even know their names? What would be your
project’s fate if that service went down permanently? It will someday,
though hopefully only after your project is dead and forgotten. If you
have the ability to work permanently offline, then you already have
happy answers to all these questions.




w64devkit: a Portable C and C++ Development Kit for Windows
2020-05-15T03:43:04Z
This article was discussed on Hacker News.

As a computer engineer, my job is to use computers to solve important
problems. Ideally my solutions will be efficient, and typically that
means making the best use of the resources at hand. Quite often these
resources are machines running Windows and, despite my misgivings about
the platform, there is much to be gained by properly and effectively
leveraging it.

Sometimes targeting Windows while working from another platform
is sufficient, but other times I must work on the platform itself. There
are various options available for C development, and I’ve
finally formalized my own development kit: w64devkit.



For most users, the value is in the 78MiB .zip available in the
“Releases” on GitHub. This (relatively) small package includes a
state-of-the-art C and C++ compiler (latest GCC), a powerful
text editor, debugger, a complete x86 assembler,
and miniature unix environment. It’s “portable” in that there’s no
installation. Just unzip it and start using it in place. With w64devkit,
it literally takes a few seconds on any Windows to get up and running
with a fully-featured, fully-equipped, first-class development
environment.

The development kit is cross-compiled entirely from source using Docker,
though Docker is not needed to actually use it. The repository is just a
Dockerfile and some documentation. The only build dependency is Docker
itself. It’s also easy to customize it for your own personal use, or to
audit and build your own if, for whatever reason, you didn’t trust my
distribution. This is in stark contrast to Windows builds of most open
source software where the build process is typically undocumented,
under-documented, obtuse, or very complicated.

From script to Docker

Publishing this is not necessarily a commitment to always keep w64devkit
up to date, but this Dockerfile is derived from (and replaces) a shell
script I’ve been using continuously for over two years now. In
this period, every time GCC has made a release, I’ve built myself a new
development kit, so I’m already in the habit.

I’ve been using Docker on and off for about 18 months now. It’s an
oddball in that it’s something I learned on the job rather than my own
time. I formed an early impression that still basically holds: The
main purpose of Docker is to contain and isolate misbehaved software to
improve its reliability. Well-behaved, well-designed software benefits
little from containers.

My unusual application of Docker here is no exception. Most software
builds are needlessly complicated and fragile, especially
Autoconf-based builds. Ironically, the worst configure scripts I’ve
dealt with come from GNU projects. They waste time on superfluous checks
(“Does your compiler define size_t?”) then produce a build that
doesn’t work anyway because you’re doing something slightly unusual.
Worst of all, despite my best efforts, the build will be contaminated by
the state of the system doing the build.

My original build script was fragile by extension. It would work on one
system, but not another due to some subtle environment change — a
slightly different system header that reveals a build system bug
(example in GCC), or the system doesn’t have a file at a certain
hard-coded absolute path that shouldn’t be hard-coded. Converting my
script to a Dockerfile locks these problems in place and makes builds
much more reliable and repeatable. The misbehavior is contained and
isolated by Docker.

Unfortunately it’s not completely contained. In each case I use make’s
-j option to parallelize the build since otherwise it would take
hours. Some of the builds have subtle race conditions, and some bad luck
in timing can cause a build to fail. Docker is good about picking up
where it left off, so it’s just a matter of trying again.

In one case a build failed because Bison and flex were not installed
even though they’re not normally needed. Some dependency isn’t expressed
correctly, and unlucky ordering leads to an unused .y file having the
wrong timestamp. Ugh. I’ve had this happen a lot more in Docker than
out, probably because file system operations are slow inside Docker and
it creates greater timing variance.

Other tools

The README explains some of my decisions, but I’ll summarize a few here:


  
    Git. Important and useful, so I’d love to have it. But it has a weird
installation (many .zip-unfriendly symlinks) tightly-coupled
with msys2, and its build system does not support cross-compilation.
I’d love to see a clean, straightforward rewrite of Git in a single,
appropriate implementation language. Imagine installing the latest Git
with go get git-scm.com/git. (Update: libgit2 is working on
it!)
  
  
    Bash. It’s a much nicer interactive shell than BusyBox-w32 ash. But
the build system doesn’t support cross-compilation, and I’m not sure
it supports Windows without some sort of compatibility layer anyway.
  
  
    Emacs. Another powerful editor. But the build system doesn’t support
cross-compilation. It’s also way too big.
  
  
    Go. Tempting to toss it in, but Go already does this all correctly
and effectively. It simply doesn’t require a specialized
distribution. It’s trivial to manage a complete Go toolchain with
nothing but Go itself on any system. People may say its language
design comes from the 1970s, but the tooling is decades ahead of
everyone else.
  


Alternatives

For a long, long time Cygwin filled this role for me. However, I never
liked its bulky nature, the complete opposite of portable. Cygwin
processes always felt second-class on Windows, particularly in that it
has its own view of the file system compared to other Windows processes.
They could never fully cooperate. I also don’t like that there’s no
toolchain for cross-compiling with Cygwin as a target — e.g. compile
Cygwin binaries from Linux. Finally it’s been essentially obsoleted by
WSL which matches or surpasses it on every front.

There’s msys and msys2, which are a bit lighter. However, I’m
still in an isolated, second-class environment with weird path
translation issues. These tools do have important uses, and it’s the
only way to compile most open source software natively on Windows. For
those builds that don’t support cross-compilation, it’s the only path
for producing Windows builds. It’s just not what I’m looking for when
developing my own software.

Update: llvm-mingw is an eerily similar project using Docker
the same way, but instead builds LLVM.

Using Docker for other builds

I also converted my GnuPG build script to a Dockerfile. Of
course I don’t plan to actually use GnuPG on Windows. I just need it
for passphrase2pgp, which I test against GnuPG. This tests the
Windows build.

In the future I may extend this idea to a few other tools I don’t intend
to include with w64devkit. If you have something in mind, you could use
my Dockerfiles as a kind of starter template.




Chunking Optimizations: Let the Knife Do the Work
2019-12-09T22:37:55Z
There’s an old saying, let the knife do the work. Whether
preparing food in the kitchen or whittling a piece of wood, don’t push
your weight into the knife. Not only is it tiring, you’re much more
likely to hurt yourself. Use the tool properly and little force will be
required.

The same advice also often applies to compilers.

Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of
data. The simplest approach would be to do it a byte at a time:

/* XOR src into dst */
void
xor512a(void *dst, void *src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}


Maybe you benchmark it or you look at the assembly output, and the
results are disappointing. Your compiler did exactly what you asked
of it and produced code that performs 64 single-byte XOR operations
(GCC 9.2.0, x86-64, -Os):

xor512a:
        xor    eax, eax
.L0:    mov    cl, [rsi+rax]
        xor    [rdi+rax], cl
        inc    rax
        cmp    rax, 64
        jne    .L0
        ret


The target architecture has wide registers so it could be doing at
least 8 bytes at a time. Since your compiler isn’t doing it, you
decide to chunk the work into 8 byte blocks yourself in an attempt to
manually implement a chunking operation. Here’s some real world
code that does so:

/* WARNING: Broken, do not use! */
void
xor512b(void *dst, void *src)
{
    uint64_t *pd = dst;
    uint64_t *ps = src;
    for (int i = 0; i < 8; i++) {
        pd[i] ^= ps[i];
    }
}


You check the assembly output of this function, and it looks much
better. It’s now processing 8 bytes at a time, so it should be about 8
times faster than before.

xor512b:
        xor    eax, eax
.L0:    mov    rcx, [rsi+rax*8]
        xor    [rdi+rax*8], rcx
        inc    rax
        cmp    rax, 8
        jne    .L0
        ret


Still, this machine has 16-byte wide registers (SSE2 xmm), so there
could be another doubling in speed. Oh well, this is good enough, so you
plug it into your program. But something strange happens: The output
is now wrong!

int
main(void)
{
    uint32_t dst[32] = {
        1, 2, 3, 4, 5, 6, 7, 8,
        9, 10, 11, 12, 13, 14, 15, 16
    };
    uint32_t src[32] = {
        1, 4, 9, 16, 25, 36, 49, 64,
        81, 100, 121, 144, 169, 196, 225, 256,
    };
    xor512b(dst, src);
    for (int i = 0; i < 16; i++) {
        printf("%d\n", (int)dst[i]);
    }
}


Your program prints 1..16 as if xor512b() was never called. You check
over everything a dozen times, and you can’t find anything wrong. Even
crazier, if you disable optimizations then the bug goes away. It must be
some kind of compiler bug!

Investigating a bit more, you learn that the -fno-strict-aliasing
option also fixes the bug. That’s because this program violates C strict
aliasing rules. An array of uint32_t was accessed as a uint64_t. As
an important optimization, compilers are allowed to assume such
variables do not alias and generate code accordingly. Otherwise every
memory store could potentially modify any variable, which limits the
compiler’s ability to produce decent code.

The original version is fine because char *, including both signed
and unsigned, has a special exemption and may alias with anything. For
the same reason, using char * unnecessarily can also make your
programs slower.

What could you do to keep the chunking operation while not running afoul
of strict aliasing? Counter-intuitively, you could use memcpy(). Copy
the chunks into legitimate, local uint64_t variables, do the work, and
copy the result back out.

void
xor512c(void *dst, void *src)
{
    for (int i = 0; i < 8; i++) {
        uint64_t buf[2];
        memcpy(buf + 0, (char *)dst + i*8, 8);
        memcpy(buf + 1, (char *)src + i*8, 8);
        buf[0] ^= buf[1];
        memcpy((char *)dst + i*8, buf, 8);
    }
}


Since memcpy() is a built-in function, your compiler knows its
semantics and can ultimately elide all that copying. The assembly
listing for xor512c is identical to xor512b, but it won’t go haywire
when integrated into a real program.

It works and it’s correct, but you can still do much better than this!

Letting your compiler do the work

The problem is you’re forcing the knife and not letting it do the work.
There’s a constraint on your compiler that hasn’t been considered: It
must work correctly for overlapping inputs.

char buf[74] = {...};
xor512a(buf, buf + 10);


In this situation, the byte-by-byte and chunked versions of the function
will have different results. That’s exactly why your compiler can’t do
the chunking operation itself. However, you don’t care about this
situation because the inputs never overlap.

Let’s revisit the first, simple implementation, but this time being
smarter about it. The restrict keyword indicates that the inputs
will not overlap, freeing your compiler of this unwanted concern.

void
xor512d(void *restrict dst, void *restrict src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}


(Side note: Adding restrict to the manually chunked function,
xor512b(), will not fix it. Using restrict can never make an
incorrect program correct.)

Compiled with GCC 9.2.0 and -O3, the resulting unrolled code
processes 16-byte chunks at a time (pxor):

xor512d:
        movdqu  xmm0, [rdi+0x00]
        movdqu  xmm1, [rsi+0x00]
        movdqu  xmm2, [rsi+0x10]
        movdqu  xmm3, [rsi+0x20]
        pxor    xmm0, xmm1
        movdqu  xmm4, [rdi+0x30]
        movups  [rdi+0x00], xmm0
        movdqu  xmm0, [rdi+0x10]
        pxor    xmm0, xmm2
        movups  [rdi+0x10], xmm0
        movdqu  xmm0, [rdi+0x20]
        pxor    xmm0, xmm3
        movups  [rdi+0x20], xmm0
        movdqu  xmm0, [rsi+0x30]
        pxor    xmm0, xmm4
        movups  [rdi+0x30], xmm0
        ret


Compiled with Clang 9.0.0 with AVX-512 enabled in the target
(-mavx512bw), it does the entire operation in a single, big chunk!

xor512d:
        vmovdqu64   zmm0, [rdi]
        vpxorq      zmm0, zmm0, [rsi]
        vmovdqu64   [rdi], zmm0
        vzeroupper
        ret


“Letting the knife do the work” means writing a correct program and
lifting unnecessary constraints so that the compiler can use whatever
chunk size is appropriate for the target.




The Day I Fell in Love with Fuzzing
2019-01-25T21:52:45Z
Follow-up: Tips for more effective fuzz testing with AFL++

This article was discussed on Hacker News and on reddit.

In 2007 I wrote a pair of modding tools, binitools, for a space
trading and combat simulation game named Freelancer. The game
stores its non-art assets in the format of “binary INI” files, or “BINI”
files. The motivation for the binary format over traditional INI files
was probably performance: it’s faster to load and read these files than
it is to parse arbitrary text in INI format.



Much of the in-game content can be changed simply by modifying these
files — changing time names, editing commodity prices, tweaking ship
statistics, or even adding new ships to the game. The binary nature
makes them unsuitable to in-place modification, so the natural approach
is to convert them to text INI files, make the desired modifications
using a text editor, then convert back to the BINI format and replace
the file in the game’s installation.

I didn’t reverse engineer the BINI format, nor was I the first person
the create tools to edit them. The existing tools weren’t to my tastes,
and I had my own vision for how they should work — an interface more
closely following the Unix tradition despite the target being a
Windows game.

When I got started, I had just learned how to use yacc (really
Bison) and lex (really flex), as well as
Autoconf, so I went all-in with these newly-discovered tools. It was
exciting to try them out in a real-world situation, though I slavishly
aped the practices of other open source projects without really
understanding why things were they way they were. Due to the use of
yacc/lex and the configure script build, compiling the project required
a full, Unix-like environment. This is all visible in the original
version of the source.

The project was moderately successful in two ways. First, I was able to
use the tools to modify the game. Second, other people were using the
tools, since the binaries I built show up in various collections of
Freelancer modding tools online.

The Rewrite

That’s the way things were until mid-2018 when I revisited the project.
Ever look at your own old code and wonder what they heck you were
thinking? My INI format was far more rigid and strict than necessary, I
was doing questionable things when writing out binary data, and the
build wasn’t even working correctly.

With an additional decade of experience under my belt, I knew I could do
way better if I were to rewrite these tools today. So, over the course
of a few days, I did, from scratch. That’s what’s visible in the master
branch today.

I like to keep things simple which meant no more Autoconf, and
instead a simple, portable Makefile. No more yacc or lex, and
instead a hand-coded parser. Using only conforming, portable C. The
result was so simple that I can build using Visual Studio in a
single, short command, so the Makefile isn’t all that necessary. With
one small tweak (replace stdint.h with a typedef), I can even build
and run binitools in DOS.

The new version is faster, leaner, cleaner, and simpler. It’s far more
flexible about its INI input, so its easier to use. But is it more
correct?

Fuzzing

I’ve been interested in fuzzing for years, especially
american fuzzy lop, or afl. However, I wasn’t having success
with it. I’d fuzz some of the tools I use regularly, and it wouldn’t
find anything of note, at least not before I gave up. I fuzzed my
JSON library, and somehow it turned up nothing. Surely my
JSON parser couldn’t be that robust already, could it? Fuzzing just
wasn’t accomplishing anything for me. (As it turns out, my JSON
library is quite robust, thanks in large part to various
contributors!)

So I’ve got this relatively new INI parser, and while it can
successfully parse and correctly re-assemble the game’s original set of
BINI files, it hasn’t really been exercised that much. Surely there’s
something in here for a fuzzer to find. Plus I don’t even have to write
a line of code in order to run afl against it. The tools already read
from standard input by default, which is perfect.

Assuming you’ve got the necessary tools installed (make, gcc, afl),
here’s how easy it is to start fuzzing binitools:

$ make CC=afl-gcc
$ mkdir in out
$ echo '[x]' > in/empty
$ afl-fuzz -i in -o out -- ./bini


The bini utility takes INI as input and produces BINI as output, so
it’s far more interesting to fuzz than its inverse, unbini. Since
unbini parses relatively simple binary data, there are (probably) no
bugs for the fuzzer to find. I did try anyway just in case.



In my example above, I swapped out the default compiler for afl’s GCC
wrapper (CC=afl-gcc). It calls GCC in the background, but in doing so
adds its own instrumentation to the binary. When fuzzing, afl-fuzz
uses that instrumentation to monitor the program’s execution path. The
afl whitepaper explains the technical details.

I also created input and output directories, placing a minimal, working
example into the input directory, which gives afl a starting point. As
afl runs, it mutates a queue of inputs and observes the changes on the
program’s execution. The output directory contains the results and, more
importantly, a corpus of inputs that cause unique execution paths. In
other words, the fuzzer output will be lots of inputs that exercise many
different edge cases.

The most exciting and dreaded result is a crash. The first time I ran it
against binitools, bini had many such crashes. Within minutes, afl
was finding a number of subtle and interesting bugs in my program, which
was incredibly useful. It even discovered an unlikely stale pointer
bug by exercising different orderings for various memory
allocations. This particular bug was the turning point that made me
realize the value of fuzzing.

Not all the bugs it found led to crashes. I also combed through the
outputs to see what sorts of inputs were succeeding, what was failing,
and observe how my program handled various edge cases. It was rejecting
some inputs I thought should be valid, accepting some I thought should
be invalid, and interpreting some in ways I hadn’t intended. So even
after I fixed the crashing inputs, I still made tweaks to the parser to
fix each of these troublesome inputs.

Building a test suite

Once I combed out all the fuzzer-discovered bugs, and I agreed with the
parser on how all the various edge cases should be handled, I turned the
fuzzer’s corpus into a test suite — though not directly.

I had run the fuzzer in parallel — a process that is explained in the
afl documentation — so I had lots of redundant inputs. By redundant I
mean that the inputs are different but have the same execution path.
Fortunately afl has a tool to deal with this: afl-cmin, the corpus
minimization tool. It eliminates all the redundant inputs.

Second, many of these inputs were longer than necessary in order to
invoke their unique execution path. There’s afl-tmin, the test case
minimizer, which I used to further shrink my test corpus.

I sorted the valid from invalid inputs and checked them into the
repository. Have a look at all the wacky inputs invented by the
fuzzer starting from my single, minimal input:


  valid inputs
  invalid inputs


This essentially locks down the parser, and the test suite ensures a
particular build behaves in a very specific way. This is most useful
for ensuring that builds on other platforms and by other compilers are
indeed behaving identically with respect to their outputs. My test suite
even revealed a bug in diet libc, as binitools doesn’t pass the tests
when linked against it. If I were to make non-trivial changes to the
parser, I’d essentially need to scrap the current test suite and start
over, having afl generate an entire new corpus for the new parser.

Fuzzing has certainly proven itself to be a powerful technique. It found
a number of bugs that I likely wouldn’t have otherwise discovered on my
own. I’ve since gotten more savvy on its use and have used it on other
software — not just software I’ve written myself — and discovered more
bugs. It’s got a permanent slot on my software developer toolbelt.




The Value of Undefined Behavior
2018-07-20T21:31:18Z
In several places, the C and C++ language specifications use a
curious, and fairly controversial, phrase: undefined behavior. For
certain program constructs, the specification prescribes no specific
behavior, instead allowing anything to happen. Such constructs
are considered erroneous, and so the result depends on the particulars
of the platform and implementation. The original purpose of undefined
behavior was for implementation flexibility. In other words, it’s
slack that allows a compiler to produce appropriate and efficient code
for its target platform.

Specifying a particular behavior would have put unnecessary burden on
implementations — especially in the earlier days of computing — making
for inefficient programs on some platforms. For example, if the result
of dereferencing a null pointer was defined to trap — to cause the
program to halt with an error — then platforms that do not have
hardware trapping, such as those without virtual memory, would be
required to instrument, in software, each pointer dereference.

In the 21st century, undefined behavior has taken on a somewhat
different meaning. Optimizers use it — or abuse it depending on your
point of view — to lift constraints that would otherwise
inhibit more aggressive optimizations. It’s not so much a
fundamentally different application of undefined behavior, but it does
take the concept to an extreme.

The reasoning works like this: A program that evaluates a construct
whose behavior is undefined cannot, by definition, have any meaningful
behavior, and so that program would be useless. As a result,
compilers assume programs never invoke undefined behavior and
use those assumptions to prove its optimizations.

Under this newer interpretation, mistakes involving undefined behavior
are more punishing and surprising than before. Programs
that seem to make some sense when run on a particular architecture may
actually compile into a binary with a security vulnerability due to
conclusions reached from an analysis of its undefined behavior.

This can be frustrating if your programs are intended to run on a very
specific platform. In this situation, all behavior really could be
locked down and specified in a reasonable, predictable way. Such a
language would be like an extended, less portable version of C or C++.
But your toolchain still insists on running your program on the
abstract machine rather than the hardware you actually care about.
However, even in this situation undefined behavior can still be
desirable. I will provide a couple of examples in this article.

Signed integer overflow

To start things off, let’s look at one of my all time favorite examples
of useful undefined behavior, a situation involving signed integer
overflow. The result of a signed integer overflow isn’t just
unspecified, it’s undefined behavior. Full stop.

This goes beyond a simple matter of whether or not the underlying
machine uses a two’s complement representation. From the perspective of
the abstract machine, just the act a signed integer overflowing is
enough to throw everything out the window, even if the overflowed result
is never actually used in the program.

On the other hand, unsigned integer overflow is defined — or, more
accurately, defined to wrap, not overflow. Both the undefined signed
overflow and defined unsigned overflow are useful in different
situations.

For example, here’s a fairly common situation, much like what actually
happened in bzip2. Consider this function that does substring
comparison:

int
cmp_signed(int i1, int i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}


In this function, the indices i1 and i2 will always be some small,
non-negative value. Since it’s non-negative, it should be unsigned,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually don’t want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.

The constraint is that the behavior of i1 or i2 overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior. On x86-64, where int is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.

In the signed case, incrementing the integers cannot overflow since that
would be undefined behavior. This permits the compiler to perform the
increment only in 64-bit precision without truncation if it would be
more efficient, which, in this case, it is.

Here’s the output of Clang 6.0.0 with -Os on x86-64. Pay close
attention to the main loop, which I named .loop:

cmp_signed:
        movsxd rdi, edi             ; use i1 as a 64-bit integer
        mov    al, [rdx + rdi]
        movsxd rsi, esi             ; use i2 as a 64-bit integer
        mov    cl, [rdx + rsi]
        jmp    .check

.loop:  mov    al, [rdx + rdi + 1]
        mov    cl, [rdx + rsi + 1]
        inc    rdx                  ; increment only the base pointer
.check: cmp    al, cl
        je     .loop

        movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx             ; return c1 - c2
        ret

cmp_unsigned:
        mov    eax, edi
        mov    al, [rdx + rax]
        mov    ecx, esi
        mov    cl, [rdx + rcx]
        cmp    al, cl
        jne    .ret
        inc    edi
        inc    esi

.loop:  mov    eax, edi             ; truncated i1 overflow
        mov    al, [rdx + rax]
        mov    ecx, esi             ; truncated i2 overflow
        mov    cl, [rdx + rcx]
        inc    edi                  ; increment i1
        inc    esi                  ; increment i2
        cmp    al, cl
        je     .loop

.ret:   movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx
        ret


As unsigned values, i1 and i2 can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, buf, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).

The signed integer helps to communicate the narrow contract of the
function — the limited range of i1 and i2 — to the compiler. In a
variant of C where signed integer overflow is defined (i.e. -fwrapv),
this capability is lost. In fact, using -fwrapv deoptimizes the signed
version of this function.

Side note: Using size_t (an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.

Strict aliasing

Another controversial undefined behavior is strict aliasing.
This particular term doesn’t actually appear anywhere in the C
specification, but it’s the popular name for C’s aliasing rules. In
short, variables with types that aren’t compatible are not allowed to
alias through pointers.

Here’s the classic example:

int
foo(int *a, int *b)
{
    *b = 0;    // store
    *a = 1;    // store
    return *b; // load
}


Naively one might assume the return *b could be optimized to a simple
return 0. However, since a and b have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.

If foo has a narrow contract that forbids a and b to alias, we
have a couple of options for helping our compiler.

First, we could manually resolve the aliasing issue by returning 0
explicitly. In more complicated functions this might mean making local
copies of values, working only with those local copies, then storing the
results back before returning. Then aliasing would no longer matter.

int
foo(int *a, int *b)
{
    *b = 0;
    *a = 1;
    return 0;
}


Second, C99 introduced a restrict qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to memcpy() are qualified with restrict as of C99.
Passing aliasing pointers through restrict parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.

int foo(int *restrict a, int *restrict b);


The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, int and long are never compatible even
when they have the same representation.

int foo(int *a, long *b);


If you use an extended or modified version of C without strict
aliasing (-fno-strict-aliasing), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.

What irritates a lot of people is that compilers will still
apply the strict aliasing rule even when it’s trivial for the compiler
to prove that aliasing is occurring:

/* note: forbidden */
long a;
int *b = (int *)&a;


It’s not just a simple matter of making exceptions for these cases.
The language specification would need to define all the rules about
when and where incompatible types are permitted to alias, and
developers would have to understand all these rules if they wanted to
take advantage of the exceptions. It can’t just come down to trusting
that the compiler is smart enough to see the aliasing when it’s
sufficiently simple. It would need to be carefully defined.

Besides, there are probably conforming, portable solutions
that, with contemporary compilers, will safely compile to the efficient
code you actually want anyway.

There is one special exception for strict aliasing: char * is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a char * pointer could force the compiler to
generate additional, unnecessary loads.

In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: uint8_t is not necessarily unsigned char.
That’s just one possible typedef definition for it. It could instead
typedef to, say, some internal __byte type.

In other words, technically speaking, uint8_t does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is willing to exploit it for optimization purposes.

Other undefined behaviors

Some kinds of undefined behavior don’t have performance or portability
benefits. They’re only there to make the compiler’s job a little
simpler. Today, most of these are caught trivially at compile time as
syntax or semantic issues (i.e. a pointer cast to a float).

Some others are obvious about their performance benefits and don’t
require much explanation. For example, it’s undefined behavior to
index out of bounds (with some special exceptions for one past the
end), meaning compilers are not obligated to generate those checks,
instead relying on the programmer to arrange, by whatever means, that
it doesn’t happen.

Undefined behavior is like nitro, a dangerous, volatile substance that
makes things go really, really fast. You could argue that it’s too
dangerous to use in practice, but the aggressive use of undefined
behavior is not without merit.




Building and Installing Software in $HOME
2017-06-19T02:34:39Z
For more than 5 years now I’ve kept a private “root” filesystem within
my home directory under $HOME/.local/. Within are the standard
/usr directories, such as bin/, include/, lib/, etc.,
containing my own software, libraries, and man pages. These are
first-class citizens, indistinguishable from the system-installed
programs and libraries. With one exception (setuid programs), none of
this requires root privileges.

Installing software in $HOME serves two important purposes, both of
which are indispensable to me on a regular basis.


  No root access: Sometimes I’m using a system administered by
someone else, and I don’t have root access.


This prevents me from installing packaged software myself through the
system’s package manager. Building and installing the software myself in
my home directory, without involvement from the system administrator,
neatly works around this issue. As a software developer, it’s already
perfectly normal for me to build and run custom software, and this is
just an extension of that behavior.

In the most desperate situation, all I need from the sysadmin is a
decent C compiler and at least a minimal POSIX environment. I can
bootstrap anything I might need, both libraries and
programs, including a better C compiler along the way. This is one
major strength of open source software.

I have noticed one alarming trend: Both GCC (since 4.8) and Clang are
written in C++, so it’s becoming less and less reasonable to bootstrap
a C++ compiler from a C compiler, or even from a C++ compiler that’s
more than a few years old. So you may also need your sysadmin to
supply a fairly recent C++ compiler if you want to bootstrap an
environment that includes C++. I’ve had to avoid some C++ software
(such as CMake) for this reason.


  Custom software builds: Even if I am root, I may still want to
install software not available through the package manager, a version
not available in the package manager, or a version with custom
patches.


In theory this is what /usr/local is all about. It’s typically the
location for software not managed by the system’s package manager.
However, I think it’s cleaner to put this in $HOME/.local, so long
as other system users don’t need it.

For example, I have an installation of each version of Emacs between
24.3 (the oldest version worth supporting) through the latest stable
release, each suffixed with its version number, under $HOME/.local.
This is useful for quickly running a test suite under different
releases.

$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...


Another example is NetHack, which I prefer to play with a couple of
custom patches (Menucolors, wchar). The install to
$HOME/.local is also captured as a patch.

$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install


Normally NetHack wants to be setuid (e.g. run as the “games” user) in
order to restrict access to high scores, saves, and bones — saved levels
where a player died, to be inserted randomly into other players’ games.
This prevents cheating, but requires root to set up. Fortunately, when I
install NetHack in my home directory, this isn’t a feature I actually
care about, so I can ignore it.

Mutt is in a similar situation, since it wants to install a
special setgid program (mutt_dotlock) that synchronizes mailbox
access. All MUAs need something like this.

Everything described below is relevant to basically any modern
unix-like system: Linux, BSD, etc. I personally install software in
$HOME across a variety of systems and, fortunately, it mostly works
the same way everywhere. This is probably in large part due to
everyone standardizing around the GCC and GNU binutils interfaces,
even if the system compiler is actually LLVM/Clang.

Configuring for $HOME installs

Out of the box, installing things in $HOME/.local won’t do anything
useful. You need to set up some environment variables in your shell
configuration (i.e. .profile, .bashrc, etc.) to tell various
programs, such as your shell, about it. The most obvious variable is
$PATH:

export PATH=$HOME/.local/bin:$PATH


Notice I put it in the front of the list. This is because I want my
home directory programs to override system programs with the same
name. For what other reason would I install a program with the same
name if not to override the system program?

In the simplest situation this is good enough, but in practice you’ll
probably need to set a few more things. If you install libraries in
your home directory and expect to use them just as if they were
installed on the system, you’ll need to tell the compiler where else
to look for those headers and libraries, both for C and C++.

export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib


The first two are like the -I compiler option and the third is like
-L linker option, except you usually won’t need to use them
explicitly. Unfortunately LIBRARY_PATH doesn’t override the system
library paths, so in some cases, you will need to explicitly set
-L. Otherwise you will still end up linking against the system library
rather than the custom packaged version. I really wish GCC and Clang
didn’t behave this way.

Some software uses pkg-config to determine its compiler and linker
flags, and your home directory will contain some of the needed
information. So set that up too:

export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig


Run-time linker

Finally, when you install libraries in your home directory, the run-time
dynamic linker will need to know where to find them. There are three
ways to deal with this:


  The crude, easy way: LD_LIBRARY_PATH.
  The elegant, difficult way: ELF runpath.
  Screw it, just statically link the bugger. (Not always possible.)


For the crude way, point the run-time linker at your lib/ and you’re
done:

export LD_LIBRARY_PATH=$HOME/.local/lib


However, this is like using a shotgun to kill a fly. If you install a
library in your home directory that is also installed on the system,
and then run a system program, it may be linked against your library
rather than the library installed on the system as was originally
intended. This could have detrimental effects.

The precision method is to set the ELF “runpath” value. It’s like a
per-binary LD_LIBRARY_PATH. The run-time linker uses this path first
in its search for libraries, and it will only have an effect on that
particular program/library. This also applies to dlopen().

Some software will configure the runpath by default in their build
system, but often you need to configure this yourself. The simplest way
is to set the LD_RUN_PATH environment variable when building software.
Another option is to manually pass -rpath options to the linker via
LDFLAGS. It’s used directly like this:

$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux


Verify with readelf:

$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]


ELF supports a special $ORIGIN “variable” set to the binary’s
location. This allows the program and associated libraries to be
installed anywhere without changes, so long as they have the same
relative position to each other . (Note the quotes to prevent shell
interpolation.)

$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux


There is one situation where runpath won’t work: when you want a
system-installed program to find a home directory library with
dlopen() — e.g. as an extension to that program. You either need to
ensure it uses a relative or absolute path (i.e. the argument to
dlopen() contains a slash) or you must use LD_LIBRARY_PATH.

Personally, I always use the Worse is Better LD_LIBRARY_PATH
shotgun. Occasionally it’s caused some annoying issues, but the vast
majority of the time it gets the job done with little fuss. This is
just my personal development environment, after all, not a production
server.

Manual pages

Another potentially tricky issue is man pages. When a program or
library installs a man page in your home directory, it would certainly
be nice to access it with man  just like it was installed on
the system. Fortunately, Debian and Debian-derived systems, using a
mechanism I haven’t yet figured out, discover home directory man pages
automatically without any assistance. No configuration needed.

It’s more complicated on other systems, such as the BSDs. You’ll need to
set the MANPATH variable to include $HOME/.local/share/man. It’s
unset by default and it overrides the system settings, which means you
need to manually include the system paths. The manpath program can
help with this … if it’s available.

export MANPATH=$HOME/.local/share/man:$(manpath)


I haven’t figured out a portable way to deal with this issue, so I
mostly ignore it.

How to install software in $HOME

While I’ve poo-pooed autoconf in the past, the standard
configure script usually makes it trivial to build and install
software in $HOME. The key ingredient is the --prefix option:

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install


Most of the time it’s that simple! If you’re linking against your own
libraries and want to use runpath, it’s a little more complicated:

$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"


For CMake, there’s CMAKE_INSTALL_PREFIX:

$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..


The CMake builds I’ve seen use ELF runpath by default, and no further
configuration may be required to make that work. I’m sure that’s not
always the case, though.

Some software is just a single, static, standalone binary with
everything baked in. It doesn’t need to be given a prefix, and
installation is as simple as copying the binary into place. For example,
Enchive works like this:

$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin


Some software uses its own unique configuration interface. I can respect
that, but it does add some friction for users who now have something
additional and non-transferable to learn. I demonstrated a NetHack build
above, which has a configuration much more involved than it really
should be. Another example is LuaJIT, which uses make variables that
must be provided consistently on every invocation:

$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install


(You can use the “install” target to both build and install, but I
wanted to illustrate the repetition of PREFIX.)

Some libraries aren’t so smart about pkg-config and need some
handholding — for example, ncurses. I mention it because
it’s required for both Vim and Emacs, among many others, so I’m often
building it myself. It ignores --prefix and needs to be told a
second time where to install things:

$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH


Another issue is that a whole lot of software has been hardcoded for
ncurses 5.x (i.e. ncurses5-config), and it requires hacks/patching
to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x
for this reason.

Learning through experience

I could go on and on like this, discussing the quirks for the various
libraries and programs that I use. Over the years I’ve gotten used to
many of these issues, committing the solutions to memory.
Unfortunately, even within the same version of a piece of software,
the quirks can change between major operating system
releases, so I’m continuously learning my way around new
issues. It’s really given me an appreciation for all the hard work
that package maintainers put into customizing and maintaining software
builds to fit properly into a larger ecosystem.




The Vulgarness of Abbreviated Function Templates
2016-10-02T23:59:59Z
The auto keyword has been a part of C and C++ since the very
beginning, originally as a one of the four storage class specifiers:
auto, register, static, and extern. An auto variable has
“automatic storage duration,” meaning it is automatically allocated at
the beginning of its scope and deallocated at the end. It’s the
default storage class for any variable without external linkage or
without static storage, so the vast majority of variables in a
typical C program are automatic.

In C and C++ prior to C++11, the following definitions are
equivalent because the auto is implied.

int
square(int x)
{
    int x2 = x * x;
    return x2;
}

int
square(int x)
{
    auto int x2 = x * x;
    return x2;
}


As a holdover from really old school C, unspecified types in C are
implicitly int, and even today you can get away with weird stuff
like this:

/* C only */
square(x)
{
    auto x2 = x * x;
    return x2;
}


By “get away with” I mean in terms of the compiler accepting this as
valid input. Your co-workers, on the other hand, may become violent.

Like register, as a storage class auto is an historical artifact
without direct practical use in modern code. However, as a concept
it’s indispensable for the specification. In practice, automatic
storage means the variables lives on “the” stack (or one of the
stacks), but the specifications make no mention of a
stack. In fact, the word “stack” doesn’t appear even once. Instead
it’s all described in terms of “automatic storage,” rightfully leaving
the details to the implementations. A stack is the most sensible
approach the vast majority of the time, particularly because it’s both
thread-safe and re-entrant.

C++11 Type Inference

One of the major changes in C++11 was repurposing the auto keyword,
moving it from a storage class specifier to a a type specifier. In
C++11, the compiler infers the type of an auto variable from its
initializer. In C++14, it’s also permitted for a function’s return
type, inferred from the return statement.

This new specifier is very useful in idiomatic C++ with its
ridiculously complex types. Transient variables, such as variables
bound to iterators in a loop, don’t need a redundant type
specification. It keeps code DRY (“Don’t Repeat Yourself”). Also,
templates easier to write, since it makes the compiler do more of the
work. The necessary type information is already semantically present,
and the compiler is a lot better at dealing with it.

With this change, the following is valid in both C and C++11, and, by
sheer coincidence, has the same meaning, but for entirely different
reasons.

int
square(int x)
{
    auto x2 = x * x;
    return x2;
}


In C the type is implied as int, and in C++11 the type is inferred
from the type of x * x, which, in this case, is int. The prior
example with auto int x2, valid in C++98 and C++03, is no longer
valid in C++11 since auto and int are redundant type specifiers.

Occasionally I wish I had something like auto in C. If I’m writing a
for loop from 0 to n, I’d like the loop variable to be the same
type as n, even if I decide to change the type of n in the future.
For example,

struct foo *foo = foo_create();
for (int i = 0; i < foo->n; i++)
    /* ... */;


The loop variable i should be the same type as foo->n. If I decide
to change the type of foo->n in the struct definition, I’d have to
find and update every loop. The idiomatic C solution is to typedef
the integer, using the new type both in the struct and in loops, but I
don’t think that’s much better.

Abbreviated Function Templates

Why is all this important? Well, I was recently reviewing some C++ and
came across this odd specimen. I’d never seen anything like it before.
Notice the use of auto for the parameter types.

void
set_odd(auto first, auto last, const auto &x)
{
    bool toggle = false;
    for (; first != last; first++, toggle = !toggle)
        if (toggle)
            *first = x;
}


Given the other uses of auto as a type specifier, this kind of makes
sense, right? The compiler infers the type from the input argument.
But, as you should often do, put yourself in the compiler’s shoes for
a moment. Given this function definition in isolation, can you
generate any code? Nope. The compiler needs to see the call site
before it can infer the type. Even more, different call sites may use
different types. That sounds an awful lot like a template, eh?

template<typename T, typename V>
void
set_odd(T first, T last, const V &x)
{
    bool toggle = false;
    for (; first != last; first++, toggle = !toggle)
        if (toggle)
            *first = x;
}


This is a proposed feature called abbreviated function
templates, part of C++ Extensions for Concepts. It’s
intended to be shorthand for the template version of the function. GCC
4.9 implements it as an extension, which is why the author was unaware
of its unofficial status. In March 2016 it was established that
abbreviated function templates would not be part of
C++17, but may still appear in a future revision.

Personally, I find this use of auto to be vulgar. It overloads the
keyword with a third definition. This isn’t unheard of — static also
serves a number of unrelated purposes — but while similar to the
second form of auto (type inference), this proposed third form is
very different in its semantics (far more complex) and overhead
(potentially very costly). I’m glad it’s been rejected so far.
Templates better reflect the nature of this sort of code.




Automatic Deletion of Incomplete Output Files
2016-08-07T02:00:37Z
Conventionally, a program that creates an output file will delete its
incomplete output should an error occur while writing the file. It’s
risky to leave behind a file that the user may rightfully confuse for
a valid file. They might not have noticed the error.

For example, compression programs such as gzip, bzip2, and xz when
given a compressed file as an argument will create a new file with the
compression extension removed. They write to this file as the
compressed input is being processed. If the compressed stream contains
an error in the middle, the partially-completed output is removed.

There are exceptions of course, such as programs that download files
over a network. The partial result has value, especially if the
transfer can be continued from where it left off. The
convention is to append another extension, such as “.part”, to
indicate a partial output.

The straightforward solution is to always delete the file as part of
error handling. A non-interactive program would report the error on
standard error, delete the file, and exit with an error code. However,
there are at least two situations where error handling would be unable
to operate: unhandled signals (usually including a segmentation fault)
and power failures. A partial or corrupted output file will be left
behind, possibly looking like a valid file.

A common, more complex approach is to name the file differently from
its final name while being written. If written successfully, the
completed file is renamed into place. This is already required for
durable replacement, so it’s basically free for many
applications. In the worst case, where the program is unable to clean
up, the obviously incomplete file is left behind only wasting space.

Looking to be more robust, I had the following misguided idea: Rely
completely on the operating system to perform cleanup in the case of a
failure. Initially the file would be configured to be automatically
deleted when the final handle is closed. This takes care of all
abnormal exits, and possibly even power failures. The program can just
exit on error without deleting the file. Once written successfully,
the automatic-delete indicator is cleared so that the file survives.

The target application for this technique supports both Linux and
Windows, so I would need to figure it out for both systems. On
Windows, there’s the flag FILE_FLAG_DELETE_ON_CLOSE. I’d just need
to find a way to clear it. On POSIX, file would be unlinked while
being written, and linked into the filesystem on success. The latter
turns out to be a lot harder than I expected.

Solution for Windows

I’ll start with Windows since the technique actually works fairly well
here — ignoring the usual, dumb Win32 filesystem caveats. This is a
little surprising, since it’s usually Win32 that makes these things
far more difficult than they should be.

The primary Win32 function for opening and creating files is
CreateFile. There are many options, but the key is
FILE_FLAG_DELETE_ON_CLOSE. Here’s how an application might typically
open a file for output.

DWORD access = GENERIC_WRITE;
DWORD create = CREATE_ALWAYS;
DWORD flags = FILE_FLAG_DELETE_ON_CLOSE;
HANDLE f = CreateFile("out.tmp", access, 0, 0, create, flags, 0);


This special flag asks Windows to delete the file as soon as the last
handle to to file object is closed. Notice I said file object, not
file, since these are different things. The catch: This flag
is a property of the file object, not the file, and cannot be removed.

However, the solution is simple. Create a new link to the file so that
it survives deletion. This even works for files residing on a network
shares.

CreateHardLink("out", "out.tmp", 0);
CloseHandle(f);  // deletes out.tmp file


The gotcha is that the underlying filesystem must be NTFS. FAT32
doesn’t support hard links. Unfortunately, since FAT32 remains the
least common denominator and is still widely used for removable media,
depending on the application, your users may expect support for saving
files to FAT32. A workaround is probably required.

Solution for Linux

This is where things really fall apart. It’s just barely possible on
Linux, it’s messy, and it’s not portable anywhere else. There’s no way
to do this for POSIX in general.

My initial thought was to create a file then unlink it. Unlike the
situation on Windows, files can be unlinked while they’re currently
open by a process. These files are finally deleted when the last file
descriptor (the last reference) is closed. Unfortunately, using
unlink(2) to remove the last link to a file prevents that file from
being linked again.

Instead, the solution is to use the relatively new (since Linux 3.11),
Linux-specific O_TMPFILE flag when creating the file. Instead of a
filename, this variation of open(2) takes a directory and creates an
unnamed, temporary file in it. These files are special in that they’re
permitted to be given a name in the filesystem at some future point.

For this example, I’ll assume the output is relative to the current
working directory. If it’s not, you’ll need to open an additional file
descriptor for the parent directory, and also use openat(2) to avoid
possible race conditions (since paths can change from under you). The
number of ways this can fail is already rapidly multiplying.

int fd = open(".", O_TMPFILE|O_WRONLY, 0600);


The catch is that only a handful of filesystems support O_TMPFILE.
It’s like the FAT32 problem above, but worse. You could easily end up
in a situation where it’s not supported, and will almost certainly
require a workaround.

Linking a file from a file descriptor is where things get messier. The
file descriptor must be linked with linkat(2) from its name on the
/proc virtual filesystem, constructed as a string. The following
snippet comes straight from the Linux open(2) manpage.

char buf[64];
sprintf(buf, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, buf, AT_FDCWD, "out", AT_SYMLINK_FOLLOW);


Even on Linux, /proc isn’t always available, such as within a chroot
or a container, so this part can fail as well. In theory there’s a way
to do this with the Linux-specific AT_EMPTY_PATH and avoid /proc,
but I couldn’t get it to work.

// Note: this doesn't actually work for me.
linkat(fd, "", AT_FDCWD, "out", AT_EMPTY_PATH);


Given the poor portability (even within Linux), the number of ways
this can go wrong, and that a workaround is definitely needed anyway,
I’d say this technique is worthless. I’m going to stick with the
tried-and-true approach for this one.




Four Ways to Compile C for Windows
2016-06-13T04:13:25Z
Update 2020: If you’re on Windows, just use w64devkit.
It’s my own toolchain distribution, and it’s the best option
available. Everything you need is in one package.

I primarily work on and develop for unix-like operating systems —
Linux in particular. However, when it comes to desktop applications,
most potential users are on Windows. Rather than develop on Windows,
which I’d rather avoid, I’ll continue developing, testing, and
debugging on Linux while keeping portability in mind. Unfortunately
every option I’ve found for building Windows C programs has some
significant limitations. These limitations advise my approach to
portability and restrict the C language features used by the program
for all platforms.

As of this writing I’ve identified four different practical ways to
build C applications for Windows. This information will definitely
become further and further out of date as this article ages, so if
you’re visiting from the future take a moment to look at the date.
Except for LLVM shaking things up recently, development tooling on
unix-like systems has had the same basic form for the past 15 years
(i.e. dominated by GCC). While Visual C++ has been around for more
than two decades, the tooling on Windows has seen more churn by
comparison.

Before I get into the specifics, let me point out a glaring problem
common to all four: Unicode arguments and filenames. Microsoft jumped
the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all
worlds, being a variable length encoding (surrogate pairs), backwards
incompatible (unlike UTF-8), and having byte-order issues (BOM).
Most Win32 functions that accept strings generally come in two flavors,
ANSI and UTF-16. The standard, portable C library functions wrap the
ANSI-flavored functions. This means portable C programs can’t interact
with Unicode filenames. (Update 2021: Now they can.) They must
call the non-portable, Windows-specific versions. This includes main
itself, which is only handed ANSI-truncated arguments.

Compare this to unix-like systems, which generally adopted UTF-8, but
rather as a convention than as a hard rule. The operating system
doesn’t know or care about Unicode. Program arguments and filenames
are just zero-terminated bytestrings. Implicitly decoding these as
UTF-8 would be a mistake anyway. What happens when the
encoding isn’t valid?

This doesn’t have to be a problem on Windows. A Windows standard C
library could connect to Windows’ Unicode-flavored functions and
encode to/from UTF-8 as needed, allowing portable programs to maintain
the bytestring illusion. It’s only that none of the existing standard
C libraries do it this way.

Mingw-w64

Of course my first natural choice is MinGW, specifically the
Mingw-w64 fork. It’s GCC ported to Windows. You can
continue relying on GCC-specific features when you need them. It’s got
all the core language features up through C11, plus the common
extensions. It’s probably packaged by your Linux distribution of
choice, making it trivial to cross-compile programs and libraries from
Linux — and with Wine you can even execute them on x86. Like regular
GCC, it outputs GDB-friendly DWARF debugging information, so you can
debug applications with GDB.

If I’m using Mingw-w64 on Windows, I prefer to do so from inside
Cygwin. Since it provides a complete POSIX environment, it maximizes
portability for the whole tool chain. This isn’t strictly required.

However, it has one big flaw. Unlike unix-like systems, Windows doesn’t
supply a system standard C library. That’s the compiler’s job. But
Mingw-w64 doesn’t have one. Instead it links against msvcrt.dll,
which isn’t officially supported by Microsoft. It just
happens to exist on modern Windows installations. Since it’s not
supported, it’s way out of date and doesn’t support much of C99. A lot
of these problems are patched over by the compiler, but if you’re
relying on Mingw-w64, you still have to stick to some C89 library
features, such as limiting yourself to the C89 printf specifiers.

Update: Mārtiņš Možeiko has pointed out __USE_MINGW_ANSI_STDIO, an
undocumented feature that fixes the printf family. I now use this by
default in all of my Mingw-w64 builds. It fixes most of the formatted
output issues, except that it’s incompatible with the format function
attribute. (Update 2021: Mingw-w64 now does the right thing
out of the box.)

Another problem is that position-independent code generation is
broken, and so ASLR is not an option. This means binaries produced
by Mingw-w64 are less secure than they should be. There are also a
number of subtle code generation bugs that might arise if you’re
doing something unusual. (Update 2021: Mingw-w64 makes PIE mandatory.)

Visual C++

The behemoth usually considered in this situation is Visual Studio and
the Visual C++ build tools. I strongly prefer open source development
tools, and Visual Studio obviously the least open source option, but
at least it’s cost-free these days. Now, I have absolutely no interest
in Visual Studio, but fortunately the Visual C++ compiler and
associated build tools can be used standalone, supporting both C and
C++.

Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that
batch file in a cmd.exe console and the Visual C++ command line build
tools will be made available in that console and in any programs
executed from it (your editor). It includes the compiler (cl.exe),
linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe),
and more. It also includes a mostly POSIX-complete make called
nmake.exe. All these tools are noisy and print a copyright banner on
every invocation, so get used to passing -nologo every time, which
suppresses some of it.

When I said behemoth, I meant it. In my experience it literally takes
hours (unattended) to install Visual Studio 2015. The good news is you
don’t actually need it all anymore. The build tools are available
standalone. While it’s still a larger and slower installation
process than it really should be, it’s is much more reasonable to
install. It’s good enough that I’d even say I’m comfortable relying on
it for Windows builds. (Update: The build tools are unfortunately no
longer standalone.)

That being said, it’s not without its flaws. Microsoft has never
announced any plans to support C99. They only care about C++, with C as
a second class citizen. Since C++11 incorporated most of C99 and
Microsoft supports C++11, Visual Studio 2015 supports most of C99. The
only things missing as far as I can tell are variable length arrays
(VLAs), complex numbers, and C99’s array parameter declarators, since
none of these were adopted by C++. Some C99 features are considered
extensions (as they would be for C89), so you’ll also get warnings about
them, which can be disabled.

The command line interface (option flags, intermediates, etc.) isn’t
quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so
you’ll need separate Makefiles, or you’ll need to use a build
system that generates Visual C++ Makefiles.

Debugging is a major problem. (Update 2022: It’s actually quite good
once you know how to do it.) Visual C++ outputs separate .pdb
program database files, which aren’t usable from GDB. Visual
Studio has a built-in debugger, though it’s not included in the
standalone Visual C++ build tools. I’m still searching for a decent
debugging solution for this scenario. I tried WinDbg, but I can’t stand
it. (Update 2022: RemedyBG is amazing.)

In general the output code performance is on par with GCC and Clang,
so you’re not really gaining or losing performance with Visual C++.

Clang

Unsurprisingly, Clang has been ported to Windows. It’s like
Mingw-w64 in that you get the same features and interface across
platforms.

Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead it
relies directly on the official Windows SDK. You’ll basically need
to install the Visual C++ build tools as if were going to build with
Visual C++. This means no practical cross-platform builds and you’re
still relying on the proprietary Microsoft toolchain. In the past you
even had to use Microsoft’s linker, but LLVM now provides its own.

It generates GDB-friendly DWARF debug information (in addition to
CodeView) so in theory you can debug with GDB again. I haven’t
given this a thorough evaluation yet.

Pelles C

Finally there’s Pelles C. It’s cost-free but not open
source. It’s a reasonable, small install that includes a full IDE with
an integrated debugger and command line tools. It has its own C
library and Win32 SDK with the most complete C11 support around. It
also supports OpenMP 3.1. All in all it’s pretty nice and is something
I wouldn’t be afraid to rely upon for Windows builds.

Like Visual C++, it has a couple of “povars” batch files to set up the
right environment, which includes a C compiler, linker, assembler,
etc. The compiler interface mostly mimics cl.exe, though there are far
fewer code generation options. The make program, pomake.exe, mimics
nmake.exe, but is even less POSIX-complete. The compiler’s output
code performance is also noticeably poorer than GCC, Clang, and Visual
C++. It’s definitely a less mature compiler.

It outputs CodeView debugging information, so GDB is of no use.
The best solution is to simply use the compiler built into the IDE,
which can be invoked directly from the command line. You don’t
normally need to code from within the IDE just to use the debugger.

Like Visual C++, it’s Windows only, so cross-compilation isn’t really
in the picture.

If performance isn’t of high importance, and you don’t require
specific code generation options, then Pelles C is a nice choice for
Windows builds.

Other Options

I’m sure there are a few other options out there, and I’d like to hear
about them so I can try them out. I focused on these since they’re all
cost free and easy to download. If I have to register or pay, then
it’s not going to beat these options.




You Can't Always Hash Pointers in C
2016-05-30T23:59:46Z
Occasionally I’ve needed to key a hash table with C pointers. I don’t
care about the contents of the object itself — especially if it might
change — just its pointer identity. For example, suppose I’m using
null-terminated strings as keys and I know these strings will always
be interned in a common table. These strings can be compared directly
by their pointer values (str_a == str_b) rather than, more slowly,
by their contents (strcmp(str_a, str_b) == 0). The intern table
ensures that these expressions both have the same result.

As a key in a hash table, or other efficient map/dictionary data
structure, I’ll need to turn pointers into numerical values. However,
C pointers aren’t integers. Following certain rules it’s permitted
to cast pointers to integers and back, but doing so will reduce the
program’s portability. The most important consideration is that the
integer form isn’t guaranteed to have any meaningful or stable
value. In other words, even in a conforming implementation, the same
pointer might cast to two different integer values. This would break
any algorithm that isn’t keenly aware of the implementation details.

To show why this is, I’m going to be citing the relevant parts of the
C99 standard (ISO/IEC 9899:1999). The draft for C99 is freely
available (and what I use myself since I’m a cheapass). My purpose is
not to discourage you from casting pointers to integers and using
the result. The vast majority of the time this works fine and as you
would expect. I just think it’s an interesting part of the language,
and C/C++ programmers should be aware of potential the trade-offs.

Integer to pointer casts

What does the standard have to say about casting pointers to integers?
§6.3.2.3¶5:


  An integer may be converted to any pointer type. Except as
previously specified, the result is implementation-defined, might
not be correctly aligned, might not point to an entity of the
referenced type, and might be a trap representation.


It also includes a footnote:


  The mapping functions for converting a pointer to an integer or an
integer to a pointer are intended to be consistent with the
addressing structure of the execution environment.


Casting an integer to a pointer depends entirely on the
implementation. This is intended for things like memory mapped
hardware. The programmer may need to access memory as a specific
physical address, which would be encoded in the source as an integer
constant and cast to a pointer of the appropriate type.

int
read_sensor_voltage(void)
{
    return *(int *)0x1ffc;
}


It may also be used by a loader and dynamic linker to compute the
virtual address of various functions and variables, then cast to a
pointer before use.

Both cases are already dependent on implementation defined behavior,
so there’s nothing lost in relying on these casts.

An integer constant expression of 0 is a special case. It casts to a
NULL pointer in all implementations (§6.3.2.3¶3). However, a NULL
pointer doesn’t necessarily point to address zero, nor is it
necessarily a zero bit pattern (i.e. beware memset and calloc on
memory with pointers). It’s just guaranteed never to compare equally
with a valid object, and it is undefined behavior to dereference.

Pointer to integer casts

What about the other way around? §6.3.2.3¶6:


  Any pointer type may be converted to an integer type. Except as
previously specified, the result is implementation-defined. If the
result cannot be represented in the integer type, the behavior is
undefined. The result need not be in the range of values of any
integer type.


Like before, it’s implementation defined. However, the negatives are a
little stronger: the cast itself may be undefined behavior. I
speculate this is tied to integer overflow. The last part makes
pointer to integer casts optional for an implementation. This is one
way that the hash table above would be less portable.

When the cast is always possible, an implementation can provide an
integer type wide enough to hold any pointer value. §7.18.1.4¶1:


  The following type designates a signed integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:

  intptr_t

  The following type designates an unsigned integer type with the
property that any valid pointer to void can be converted to this
type, then converted back to pointer to void, and the result will
compare equal to the original pointer:

  uintptr_t

  These types are optional.


The take-away is that the integer has no meaningful value. The only
guarantee is that the integer can be cast back into a void pointer
that will compare equally. It would be perfectly legal for an
implementation to pass these assertions (and still sometimes fail).

void
example(void *ptr_a, void *ptr_b)
{
    if (ptr_a == ptr_b) {
        uintptr_t int_a = (uintptr_t)ptr_a;
        uintptr_t int_b = (uintptr_t)ptr_b;
        assert(int_a != int_b);
        assert((void *)int_a == (void *)int_b);
    }
}


Since the bits don’t have any particular meaning, arithmetic
operations involving them will also have no meaning. When a pointer
might map to two different integers, the hash values might not match
up, breaking hash tables that rely on them. Even with uintptr_t
provided, casting pointers to integers isn’t useful without also
relying on implementation defined properties of the result.

Reasons for this pointer insanity

What purpose could such strange pointer-to-integer casts serve?

A security-conscious implementation may choose to annotate pointers
with additional information by setting unused bits. It might be for
baggy bounds checks or, someday, in an undefined behavior
sanitizer. Before dereferencing annotated pointers, the
metadata bits would be checked for validity, and cleared/set before
use as an address. Or it may map the same object at multiple virtual
addresses) to avoid setting/clearing the metadata bits,
providing interoperability with code unaware of the annotations. When
pointers are compared, these bits would be ignored.

When these annotated pointers are cast to integers, the metadata bits
will be present, but a program using the integer wouldn’t know their
meaning without tying itself closely to that implementation.
Completely unused bits may even be filled with random garbage when
cast. It’s allowed.

You may have been thinking before about using a union or char * to
bypass the cast and access the raw pointer bytes, but you’d run into
the same problems on the same implementations.

Conforming programs

The standard makes a distinction between strictly conforming
programs (§4¶5) and conforming programs (§4¶7). A strictly
conforming program must not produce output depending on implementation
defined behavior nor exceed minimum implementation limits. Very few
programs fit in this category, including any program using uintptr_t
since it’s optional. Here are more examples of code that isn’t
strictly conforming:

    printf("%zu", sizeof(int)); // §6.5.3.4
    printf("%d", -1 >> 1);      // §6.5¶4
    printf("%d", MAX_INT);      // §5.2.4.2.1


On the other hand, a conforming program is allowed to depend on
implementation defined behavior. Relying on meaningful, stable values
for pointers cast to uintptr_t/intptr_t is conforming even if your
program may exhibit bugs on some implementations.




Counting Processor Cores in Emacs
2015-10-14T03:17:16Z
One of the great advantages of dependency analysis is parallelization.
Modern processors reorder instructions whose results don’t affect each
other. Compilers reorder expressions and statements to improve
throughput. Build systems know which outputs are inputs for other
targets and can choose any arbitrary build order within that
constraint. This article involves the last case.

The build system I use most often is GNU Make, either directly or
indirectly (Autoconf, CMake). It’s far from perfect, but it does what
I need. I almost always invoke it from within Emacs rather than in a
terminal. In fact, I do it so often that I’ve wrapped Emacs’ compile
command for rapid invocation.

I recently helped a co-worker set this set up for himself, so it had
me thinking about the problem again. The situation in my
config is much more complicated than it needs to be, so I’ll
share a simplified version instead.

First bring in the usual goodies (we’re going to be making closures):

;;; -*- lexical-binding: t; -*-
(require 'cl-lib)


We need a couple of configuration variables.

(defvar quick-compile-command "make -k ")
(defvar quick-compile-build-file "Makefile")


Then a couple of interactive functions to set these on the fly. It’s
not strictly necessary, but I like giving each a key binding. I also
like having a history available via read-string, so I can switch
between a couple of different options with ease.

(defun quick-compile-set-command (command)
  (interactive
   (list (read-string "Command: " quick-compile-command)))
  (setf quick-compile-command command))

(defun quick-compile-set-build-file (build-file)
  (interactive
   (list (read-string "Build file: " quick-compile-build-file)))
  (setf quick-compile-build-file build-file))


Now finally to the good part. Below, quick-compile is a
non-interactive function that returns an interactive closure ready to
be bound to any key I desire. It takes an optional target. This means
I don’t use the above quick-compile-set-command to choose a target,
only for setting other options. That will make more sense in a moment.

(cl-defun quick-compile (&optional (target ""))
  "Return an interaction function that runs `compile' for TARGET."
  (lambda ()
    (interactive)
    (save-buffer)  ; so I don't get asked
    (let ((default-directory
            (locate-dominating-file
             default-directory quick-compile-build-file)))
      (if default-directory
          (compile (concat quick-compile-command " " target))
        (error "Cannot find %s" quick-compile-build-file)))))


It traverses up (down?) the directory hierarchy towards root looking
for a Makefile — or whatever is set for quick-compile-build-file
— then invokes the build system there. I don’t believe in recursive
make.

So how do I put this to use? I clobber some key bindings I don’t
otherwise care about. A better choice might be the F-keys, but my
muscle memory is already committed elsewhere.

(global-set-key (kbd "C-x c") (quick-compile)) ; default target
(global-set-key (kbd "C-x C") (quick-compile "clean"))
(global-set-key (kbd "C-x t") (quick-compile "test"))
(global-set-key (kbd "C-x r") (quick-compile "run"))


Each of those invokes a different target without second guessing me.
Let me tell you, having “clean” at the tip of my fingers is wonderful.

Parallel Builds

An extension common to many different make programs is -j, which
asks make to build targets in parallel where possible. These days
where multi-core machines are the norm, you nearly always want to use
this option, ideally set to the number of logical processor cores on
your system. It’s a huge time-saver.

My recent revelation was that my default build command could be
better: make -k is minimal. It should at least include -j, but
choosing an argument (number of processor cores) is a problem. Today I
use different machines with 2, 4, or 8 cores, so most of the time any
given number will be wrong. I could use a per-system configuration,
but I’d rather not. Unfortunately GNU Make will not automatically
detect the number of cores. That leaves the matter up to Emacs Lisp.

Emacs doesn’t currently have a built-in function that returns the
number of processor cores. I’ll need to reach into the operating
system to figure it out. My usual development environments are Linux,
Windows, and OpenBSD, so my solution should work on each. I’ve ranked
them by order of importance.

Number of cores on Linux

Linux has the /proc virtual filesystem in the fashion of Plan 9,
allowing different aspects of the system to be explored through the
standard filesystem API. The relevant file here is /proc/cpuinfo,
listing useful information about each of the system’s processors. To
get the number of processors, count the number of processor entries in
this file. I’ve wrapped it in if-file-exists so that it returns
nil on other operating systems instead of throwing an error.

(when (file-exists-p "/proc/cpuinfo")
  (with-temp-buffer
    (insert-file-contents "/proc/cpuinfo")
    (how-many "^processor[[:space:]]+:")))


Number of cores on Windows

When I was first researching how to do this on Windows, I thought I
would need to invoke the wmic command line program and hope the
output could be parsed the same way on different versions of the
operating system and tool. However, it turns out the solution for
Windows is trivial. The environment variable NUMBER_OF_PROCESSORS
gives every process the answer for free. Being an environment
variable, it will need to be parsed.

(let ((number-of-processors (getenv "NUMBER_OF_PROCESSORS")))
  (when number-of-processors
    (string-to-number number-of-processors)))


Number of cores on BSD

This seems to work the same across all the BSDs, including OS X,
though I haven’t yet tested it exhaustively. Invoke sysctl, which
returns an undecorated number to be parsed.

(with-temp-buffer
  (ignore-errors
    (when (zerop (call-process "sysctl" nil t nil "-n" "hw.ncpu"))
      (string-to-number (buffer-string)))))


Also not complicated, but it’s the heaviest solution of the three.

Putting it all together

Join all these together with or, call it numcores, and ta-da.

(setf quick-compile-command (format "make -kj%d" (numcores)))


Now make is invoked correctly on any system by default.




Recovering Live Data with GDB
2015-09-15T14:53:44Z
I recently ran into a problem where long-running program
output was trapped in a C FILE buffer. The program had been running
for two days straight printing its results, but the last few kilobytes
of output were missing. It wouldn’t output these last bytes until the
program completed its day-long (or worse!) cleanup operation and
exited. This is easy to fix — and, honestly, the cleanup step was
unnecessary anyway — but I didn’t want to start all over and wait
two more days to recompute the result.

Here’s a minimal example of the situation. The first loop represents
the long-running computation and the infinite loop represents a
cleanup job that will never complete.

#include 

int
main(void)
{
    /* Compute output. */
    for (int i = 0; i < 10; i++)
        printf("%d/%d ", i, i * i);
    putchar('\n');

    /* "Slow" cleanup operation ... */
    for (;;)
        ;
    return 0;
}


Buffered Output Review

Both printf and putchar are C library functions and are usually
buffered in some way. That is, each call to these functions doesn’t
necessarily send data out of the program. This is in contrast to the
POSIX functions read and write, which are unbuffered system calls.
Since system calls are relatively expensive, buffered input and output
is used to change a large number of system calls on small buffers into
a single system call on a single large buffer.

Typically, stdout is line-buffered if connected to a terminal. When
the program completes a line of output, the user probably wants to see
it immediately. So, if you compile the example program and run it at
your terminal you will probably see the output before the program
hangs on the infinite loop.

$ cc -std=c99 example.c
$ ./a.out
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81


However, when stdout is connected to a file or pipe, it’s generally
buffered to something like 4kB. For this program, the output will
remain empty no matter how long you wait. It’s trapped in a FILE
buffer in process memory.

$ ./a.out > output.txt


The primary way to fix this is to use the fflush function, to force
the buffer empty before starting a long, non-output operation.
Unfortunately for me I didn’t think of this two days earlier.

Debugger to the Rescue

Fortunately there is a way to interrupt a running program and
manipulate its state: a debugger. First, find the process ID of the
running program (the one writing to output.txt above).

$ pgrep a.out
12934


Now attach GDB, which will pause the program’s execution.

$ gdb ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
gdb> attach 12934
Attaching to program: /tmp/a.out, process 12934
... snip ...
0x0000000000400598 in main ()
gdb>


From here I could examine the stdout FILE struct and try to extract
the buffer contents by hand. However, the easiest thing is to do is
perform the call I forgot in the first place: fflush(stdout).

gdb> call fflush(stdout)
$1 = 0
gdb> quit
Detaching from program: /tmp/a.out, process 12934


The program is still running, but the output has been recovered.

$ cat output.txt
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81


Why Cleanup?

As I said, in my case the cleanup operation was entirely unnecessary,
so it would be safe to just kill the program at this point. It was
taking a really long time to tear down a humongous data structure (on
the order of 50GB) one little node at a time with free. Obviously,
the memory would be freed much more quickly by the OS when the program
exited.

Freeing memory in the program was only to satisfy Valgrind,
since it’s so incredibly useful for debugging. Not freeing the data
structure would hide actual memory leaks in Valgrind’s final report.
For the real “production” run, I should have disabled cleanup.




C Object Oriented Programming
2014-10-21T03:52:43Z
Object oriented programming, polymorphism in particular, is
essential to nearly any large, complex software system. Without it,
decoupling different system components is difficult. (Update in
2017: I no longer agree with this statement.) C doesn’t come with
object oriented capabilities, so large C programs tend to grow their
own out of C’s primitives. This includes huge C projects like the
Linux kernel, BSD kernels, and SQLite.

Starting Simple

Suppose you’re writing a function pass_match() that takes an input
stream, an output stream, and a pattern. It works sort of like grep.
It passes to the output each line of input that matches the pattern.
The pattern string contains a shell glob pattern to be handled by
POSIX fnmatch(). Here’s what the interface looks like.

void pass_match(FILE *in, FILE *out, const char *pattern);


Glob patterns are simple enough that pre-compilation, as would be done
for a regular expression, is unnecessary. The bare string is enough.

Some time later the customer wants the program to support regular
expressions in addition to shell-style glob patterns. For efficiency’s
sake, regular expressions need to be pre-compiled and so will not be
passed to the function as a string. It will instead be a POSIX
regex_t object. A quick-and-dirty approach might be to
accept both and match whichever one isn’t NULL.

void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);


Bleh. This is ugly and won’t scale well. What happens when more kinds
of filters are needed? It would be much better to accept a single
object that covers both cases, and possibly even another kind of
filter in the future.

A Generalized Filter

One of the most common ways to customize the the behavior of a
function in C is to pass a function pointer. For example, the final
argument to qsort() is a comparator that determines how
objects get sorted.

For pass_match(), this function would accept a string and return a
boolean value deciding if the string should be passed to the output
stream. It gets called once on each line of input.

void pass_match(FILE *in, FILE *out, bool (*match)(const char *));


However, this has one of the same problems as qsort():
the passed function lacks context. It needs a pattern string or
regex_t object to operate on. In other languages these would be
attached to the function as a closure, but C doesn’t have closures. It
would need to be smuggled in via a global variable, which is not
good.

static regex_t regex;  // BAD!!!

bool regex_match(const char *string)
{
    return regexec(&regex, string, 0, NULL, 0) == 0;
}


Because of the global variable, in practice pass_match() would be
neither reentrant nor thread-safe. We could take a lesson from GNU’s
qsort_r() and accept a context to be passed to the filter function.
This simulates a closure.

void pass_match(FILE *in, FILE *out,
                bool (*match)(const char *, void *), void *context);


The provided context pointer would be passed to the filter function as
the second argument, and no global variables are needed. This would
probably be good enough for most purposes and it’s about as simple as
possible. The interface to pass_match() would cover any kind of
filter.

But wouldn’t it be nice to package the function and context together
as one object?

More Abstraction

How about putting the context on a struct and making an interface out
of that? Here’s a tagged union that behaves as one or the other.

enum filter_type { GLOB, REGEX };

struct filter {
    enum filter_type type;
    union {
        const char *pattern;
        regex_t regex;
    } context;
};


There’s one function for interacting with this struct:
filter_match(). It checks the type member and calls the correct
function with the correct context.

bool filter_match(struct filter *filter, const char *string)
{
    switch (filter->type) {
    case GLOB:
        return fnmatch(filter->context.pattern, string, 0) == 0;
    case REGEX:
        return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
    }
    abort(); // programmer error
}


And the pass_match() API now looks like this. This will be the final
change to pass_match(), both in implementation and interface.

void pass_match(FILE *input, FILE *output, struct filter *filter);


It still doesn’t care how the filter works, so it’s good enough to
cover all future cases. It just calls filter_match() on the pointer
it was given. However, the switch and tagged union aren’t friendly
to extension. Really, it’s outright hostile. We finally have some
degree of polymorphism, but it’s crude. It’s like building duct tape
into a design. Adding new behavior means adding another switch case.
This is a step backwards. We can do better.

Methods

With the switch we’re no longer taking advantage of function
pointers. So what about putting a function pointer on the struct?

struct filter {
    bool (*match)(struct filter *, const char *);
};


The filter itself is passed as the first argument, providing context.
In object oriented languages, that’s the implicit this argument. To
avoid requiring the caller to worry about this detail, we’ll hide it
in a new switch-free version of filter_match().

bool filter_match(struct filter *filter, const char *string)
{
    return filter->match(filter, string);
}


Notice we’re still lacking the actual context, the pattern string or
the regex object. Those will be different structs that embed the
filter struct.

struct filter_regex {
    struct filter filter;
    regex_t regex;
};

struct filter_glob {
    struct filter filter;
    const char *pattern;
};


For both the original filter struct is the first member. This is
critical. We’re going to be using a trick called type punning. The
first member is guaranteed to be positioned at the beginning of the
struct, so a pointer to a struct filter_glob is also a pointer to a
struct filter. Notice any resemblance to inheritance?

Each type, glob and regex, needs its own match method.

static bool
method_match_regex(struct filter *filter, const char *string)
{
    struct filter_regex *regex = (struct filter_regex *) filter;
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static bool
method_match_glob(struct filter *filter, const char *string)
{
    struct filter_glob *glob = (struct filter_glob *) filter;
    return fnmatch(glob->pattern, string, 0) == 0;
}


I’ve prefixed them with method_ to indicate their intended usage. I
declared these static because they’re completely private. Other
parts of the program will only be accessing them through a function
pointer on the struct. This means we need some constructors in order
to set up those function pointers. (For simplicity, I’m not error
checking.)

struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.match = method_match_regex;
    return &regex->filter;
}

struct filter *filter_glob_create(const char *pattern)
{
    struct filter_glob *glob = malloc(sizeof(*glob));
    glob->pattern = pattern;
    glob->filter.match = method_match_glob;
    return &glob->filter;
}


Now this is real polymorphism. It’s really simple from the user’s
perspective. They call the correct constructor and get a filter object
that has the desired behavior. This object can be passed around
trivially, and no other part of the program worries about how it’s
implemented. Best of all, since each method is a separate function
rather than a switch case, new kinds of filter subtypes can be
defined independently. Users can create their own filter types that
work just as well as the two “built-in” filters.

Cleaning Up

Oops, the regex filter needs to be cleaned up when it’s done, but the
user, by design, won’t know how to do it. Let’s add a free() method.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
};

void filter_free(struct filter *filter)
{
    return filter->free(filter);
}


And the methods for each. These would also be assigned in the
constructor.

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = (struct filter_regex *) f;
    regfree(&regex->regex);
    free(f);
}

static void
method_free_glob(struct filter *f)
{
    free(f);
}


The glob constructor should perhaps strdup() its pattern as a
private copy, in which case it would be freed here.

Object Composition

A good rule of thumb is to prefer composition over inheritance. Having
tidy filter objects opens up some interesting possibilities for
composition. Here’s an AND filter that composes two arbitrary filter
objects. It only matches when both its subfilters match. It supports
short circuiting, so put the faster, or most discriminating, filter
first in the constructor (user’s responsibility).

struct filter_and {
    struct filter filter;
    struct filter *sub[2];
};

static bool
method_match_and(struct filter *f, const char *s)
{
    struct filter_and *and = (struct filter_and *) f;
    return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);
}

static void
method_free_and(struct filter *f)
{
    struct filter_and *and = (struct filter_and *) f;
    filter_free(and->sub[0]);
    filter_free(and->sub[1]);
    free(f);
}

struct filter *filter_and(struct filter *a, struct filter *b)
{
    struct filter_and *and = malloc(sizeof(*and));
    and->sub[0] = a;
    and->sub[1] = b;
    and->filter.match = method_match_and;
    and->filter.free = method_free_and;
    return &and->filter;
}


It can combine a regex filter and a glob filter, or two regex filters,
or two glob filters, or even other AND filters. It doesn’t care what
the subfilters are. Also, the free() method here frees its
subfilters. This means that the user doesn’t need to keep hold of
every filter created, just the “top” one in the composition.

To make composition filters easier to use, here are two “constant”
filters. These are statically allocated, shared, and are never
actually freed.

static bool
method_match_any(struct filter *f, const char *string)
{
    return true;
}

static bool
method_match_none(struct filter *f, const char *string)
{
    return false;
}

static void
method_free_noop(struct filter *f)
{
}

struct filter FILTER_ANY  = { method_match_any,  method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };


The FILTER_NONE filter will generally be used with a (theoretical)
filter_or() and FILTER_ANY will generally be used with the
previously defined filter_and().

Here’s a simple program that composes multiple glob filters into a
single filter, one for each program argument.

int main(int argc, char **argv)
{
    struct filter *filter = &FILTER_ANY;
    for (char **p = argv + 1; *p; p++)
        filter = filter_and(filter_glob_create(*p), filter);
    pass_match(stdin, stdout, filter);
    filter_free(filter);
    return 0;
}


Notice only one call to filter_free() is needed to clean up the
entire filter.

Multiple Inheritance

As I mentioned before, the filter struct must be the first member of
filter subtype structs in order for type punning to work. If we want
to “inherit” from two different types like this, they would both need
to be in this position: a contradiction.

Fortunately type punning can be generalized such that it the
first-member constraint isn’t necessary. This is commonly done through
a container_of() macro. Here’s a C99-conforming definition.

#include 

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))


Given a pointer to a member of a struct, the container_of() macro
allows us to back out to the containing struct. Suppose the regex
struct was defined differently, so that the regex_t member came
first.

struct filter_regex {
    regex_t regex;
    struct filter filter;
};


The constructor remains unchanged. The casts in the methods change to
the macro.

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);

}


It’s a constant, compile-time computed offset, so there should be no
practical performance impact. The filter can now participate freely in
other intrusive data structures, like linked lists and such. It’s
analogous to multiple inheritance.

Vtables

Say we want to add a third method, clone(), to the filter API, to
make an independent copy of a filter, one that will need to be
separately freed. It will be like the copy assignment operator in C++.
Each kind of filter will need to define an appropriate “method” for
it. As long as new methods like this are added at the end, this
doesn’t break the API, but it does break the ABI regardless.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};


The filter object is starting to get big. It’s got three pointers —
24 bytes on modern systems — and these pointers are the same between
all instances of the same type. That’s a lot of redundancy. Instead,
these pointers could be shared between instances in a common table
called a virtual method table, commonly known as a vtable.

Here’s a vtable version of the filter API. The overhead is now only
one pointer regardless of the number of methods in the interface.

struct filter {
    struct filter_vtable *vtable;
};

struct filter_vtable {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};


Each type creates its own vtable and links to it in the constructor.
Here’s the regex filter re-written for the new vtable API and clone
method. This is all the tricks in one basket for a big object oriented
C finale!

struct filter *filter_regex_create(const char *pattern);

struct filter_regex {
    regex_t regex;
    const char *pattern;
    struct filter filter;
};

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);
}

static struct filter *
method_clone_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return filter_regex_create(regex->pattern);
}

/* vtable */
struct filter_vtable filter_regex_vtable = {
    method_match_regex, method_free_regex, method_clone_regex
};

/* constructor */
struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regex->pattern = pattern;
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.vtable = &filter_regex_vtable;
    return &regex->filter;
}


This is almost exactly what’s going on behind the scenes in C++. When
a method/function is declared virtual, and therefore dispatches
based on the run-time type of its left-most argument, it’s listed in
the vtables for classes that implement it. Otherwise it’s just a
normal function. This is why functions need to be declared virtual
ahead of time in C++.

In conclusion, it’s relatively easy to get the core benefits of object
oriented programming in plain old C. It doesn’t require heavy use of
macros, nor do users of these systems need to know that underneath
it’s an object system, unless they want to extend it for themselves.

Here’s the whole example program once if you’re interested in poking:


  https://gist.github.com/skeeto/5faa131b19673549d8ca





Duck Typing vs. Type Erasure
2014-04-01T21:07:31Z
Consider the following C++ class.

#include 

template <typename T>
struct Caller {
  const T callee_;
  Caller(const T callee) : callee_(callee) {}
  void go() { callee_.call(); }
};


Caller can be parameterized to any type so long as it has a call()
method. For example, introduce two types, Foo and Bar.

struct Foo {
  void call() const { std::cout << "Foo"; }
};

struct Bar {
  void call() const { std::cout << "Bar"; }
};

int main() {
  Caller<Foo> foo{Foo()};
  Caller<Bar> bar{Bar()};
  foo.go();
  bar.go();
  std::cout << std::endl;
  return 0;
}


This code compiles cleanly and, when run, emits “FooBar”. This is an
example of duck typing — i.e., “If it looks like a duck, swims like
a duck, and quacks like a duck, then it probably is a duck.” Foo and
Bar are unrelated types. They have no common inheritance, but by
providing the expected interface, they both work with with Caller.
This is a special case of polymorphism.

Duck typing is normally only found in dynamically typed languages.
Thanks to templates, a statically, strongly typed language like C++
can have duck typing without sacrificing any type safety.

Java Duck Typing

Let’s try the same thing in Java using generics.

class Caller<T> {
    final T callee;
    Caller(T callee) {
        this.callee = callee;
    }
    public void go() {
        callee.call();  // compiler error: cannot find symbol call
    }
}

class Foo {
    public void call() { System.out.print("Foo"); }
}

class Bar {
    public void call() { System.out.print("Bar"); }
}

public class Main {
    public static void main(String args[]) {
        Caller<Foo> f = new Caller<>(new Foo());
        Caller<Bar> b = new Caller<>(new Bar());
        f.go();
        b.go();
        System.out.println();
    }
}


The program is practically identical, but this will fail with a
compile-time error. This is the result of type erasure. Unlike C++’s
templates, there will only ever be one compiled version of Caller, and
T will become Object. Since Object has no call() method, compilation
fails. The generic type is only for enabling additional compiler
checks later on.

C++ templates behave like a macros, expanded by the compiler once for
each different type of applied parameter. The call symbol is looked
up later, after the type has been fully realized, not when the
template is defined.

To fix this, Foo and Bar need a common ancestry. Let’s make this
Callee.

interface Callee {
    void call();
}


Caller needs to be redefined such that T is a subclass of Callee.

class Caller<T extends Callee> {
    // ...
}


This now compiles cleanly because call() will be found in Callee.
Finally, implement Callee.

class Foo implements Callee {
    // ...
}

class Bar implements Callee {
    // ...
}


This is no longer duck typing, just plain old polymorphism. Type
erasure prohibits duck typing in Java (outside of dirty reflection
hacks).

Signals and Slots and Events! Oh My!

Duck typing is useful for implementing the observer pattern without as
much boilerplate. A class can participate in the observer pattern
without inheriting from some specialized class or interface.
For example, see the various signal and slots systems for C++.
In constrast, Java has an EventListener type for everything:


  KeyListener
  MouseListener
  MouseMotionListener
  FocusListener
  ActionListener, etc.


A class concerned with many different kinds of events, such as an
event logger, would need to inherit a large number of interfaces.