Two Games with Monte Carlo Tree Search

Monte Carlo tree search (MCTS) is the most impressive game artificial intelligence I’ve ever used. At its core it simulates a large number of games (playouts), starting from the current game state, using random moves for each player. Then it simply picks the move where it won most often. This description is sufficient to spot one of its most valuable features: MCTS requires no knowledge of strategy or effective play. The game’s rules — enough to simulate the game — are all that’s needed to allow the AI to make decent moves. Expert knowledge still makes for a stronger AI, but, more many games, it’s unnecessary to construct a decent opponent.

A second valuable feature is that it’s easy to parallelize. Unlike alpha-beta pruning, which doesn’t mix well with parallel searches of a Minimax tree, Monte Carlo simulations are practically independent and can be run in parallel.

Finally, the third valuable feature is that the search can be stopped at any time. The completion of any single simulation is as good a stopping point as any. It could be due to a time limit, a memory limit, or both. In general, the algorithm converges to a best move rather than suddenly discovering it. The good moves are identified quickly, and further simulations work to choose among them. More simulations make for better moves, with exponentially diminishing returns. Contrasted with Minimax, stopping early has the risk that the good moves were never explored at all.

To try out MCTS myself, I wrote two games employing it:

They’re both written in C, for both unix-like and Windows, and should be easy to build. I challenge you to beat them both. The Yavalath AI is easier to beat due to having blind spots, which I’ll discuss below. The Connect Four AI is more difficult and will likely take a number of tries.

Connect Four

MCTS works very well with Connect Four, and only requires modest resources: 32MB of memory to store the results of random playouts, and 500,000 game simulations. With a few tweaks, it can even be run in DOSBox. It stops when it hits either of those limits. In theory, increasing both would make for stronger moves, but in practice I can’t detect any difference. It’s like computing pi with Monte Carlo, where eventually it just runs out of precision to make any more progress.

Based on my simplified description above, you might wonder why it needs all that memory. Not only does MCTS need to track its win/loss ratio for each available move from the current state, it tracks the win/loss ratio for moves in the states behind those moves. A large chunk of the game tree is kept in memory to track all of the playout results. This is why MCTS needs a lot more memory than Minimax, which can discard branches that have been searched.

A convenient property of this tree is that the branch taken in the actual game can be re-used in a future search. The root of the tree becomes the node representing the taken game state, which has already seen a number of playouts. Even better, MCTS is weighted towards exploring good moves over bad moves, and good moves are more likely to be taken in the real game. In general, a significant portion of the tree gets to be reused in a future search.

I’m going to skip most of the details of the algorithm itself and focus on my implementation. Other articles do a better job at detailing the algorithm than I could.

My Connect Four engine doesn’t use dynamic allocation for this tree (or at all). Instead it manages a static buffer — an array of tree nodes, each representing a game state. All nodes are initially chained together into a linked list of free nodes. As the tree is built, nodes are pulled off the free list and linked together into a tree. When the game advances to the next state, nodes on unreachable branches are added back to the free list.

If at any point the free list is empty when a new node is needed, the current search aborts. This is the out-of-memory condition, and no more searching can be performed.

/* Connect Four is normally a 7 by 6 grid. */
#define CONNECT4_WIDTH  7

struct connect4_node {
    uint32_t next[CONNECT4_WIDTH];      // "pointer" to next node
    uint32_t playouts[CONNECT4_WIDTH];  // number of playouts
    float    score[CONNECT4_WIDTH];     // pseudo win/loss ratio

Rather than native C pointers, the structure uses 32-bit indexes into the master array. This saves a lot of memory on 64-bit systems, and the structure is the same size no matter the pointer size of the host. The next field points to the next state for the nth move. Since 0 is a valid index, -1 represents null (CONNECT4_NULL).

Each column is a potential move, so there are CONNECT4_WIDTH possible moves at any given state. Each move has a floating point score and a total number of playouts through that move. In my implementation, the search can also halt due to an overflow in a playout counter. The search can no longer be tracked in this representation, so it has to stop. This generally only happens when the game is nearly over and it’s grinding away on a small number of possibilities.

Note that the actual game state (piece positions) is not tracked in the node structure. That’s because it’s implicit. We know the state of the game at the root, and simulating the moves while descending the tree will keep track of the board state at the current node. That’s more memory savings.

The state itself is a pair of bitboards, one for each player. Each position on the grid gets a bit on each bitboard. The bitboard is very fast to manipulate, and win states are checked with just a handful of bit operations. My intention was to make playouts as fast as possible.

struct connect4_ai {
    uint64_t state[2];         // game state at root (bitboard)
    uint64_t rng[2];           // random number generator state
    uint32_t nodes_available;  // total number of nodes available
    uint32_t nodes_allocated;  // number of nodes in the tree
    uint32_t root;             // "pointer" to root node
    uint32_t free;             // "pointer" to free list
    int turn;                  // whose turn (0 or 1) at the root?

The nodes_available and nodes_allocated are not necessary for correctness nor speed. They’re useful for diagnostics and debugging.

All the functions that operate on these two structures are straightforward, except for connect4_playout, a recursive function which implements the bulk of MCTS. Depending on the state of the node it’s at, it does one of two things:

That’s pretty much all there is to it.


Yavalath is a board game invented by a computer program. It’s a pretty fascinating story. The depth and strategy are disproportionately deep relative to its dead simple rules: Get four in a row without first getting three in a row. The game revolves around forces moves.

The engine is structured almost identically to the Connect Four engine. It uses 32-bit indexes instead of pointers. The game state is a pair of bitboards, with end-game masks computed at compile time via metaprogramming. The AI allocates the tree from a single, massive buffer — multiple GBs in this case, dynamically scaled to the available physical memory. And the core MCTS function is nearly identical.

One important difference is that identical game states — states where the pieces on the board are the same, but the node was reached through a different series of moves — are coalesced into a single state in the tree. This state deduplication is done through a hash table. This saves on memory and allows multiple different paths through the game tree to share playouts. It comes at a cost of including the game state in the node (so it can be identified in the hash table) and reference counting the nodes (since they might have more than one parent).

Unfortunately the AI has blind spots, and once you learn to spot them it becomes easy to beat consistently. It can’t spot certain kinds of forced moves, so it always falls for the same tricks. The official Yavalath AI is slightly stronger than mine, but has a similar blindness. I think MCTS just isn’t quite a good fit for Yavalath.

The AI’s blindness is caused by shallow traps, a common problem for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is a branch in the game tree where the game will abruptly end in a small number of turns. If the random tree search doesn’t luckily stumble upon a trap during its random traversal, it can’t take it into account in its final decision. A skilled player will lead the game towards one of these traps, and the AI will blunder along, not realizing what’s happened until its too late.

I almost feel bad for it when this happens. If you watch the memory usage and number of playouts, once it falls into a trap, you’ll see it using almost no memory while performing a ton of playouts. It’s desperately, frantically searching for a way out of the trap. But it’s too late, little AI.

Another Tool in the Toolbelt

I’m really happy to have sunk a couple weekends into playing with MCTS. It’s not always a great fit, as seen with Yavalath, but it’s a really neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready to use it should I run into an appropriate problem in the future.

My Journey with Touch Typing and Vim

Given the title, the publication date of this article is probably really confusing. This was deliberate.

Three weeks ago I made a conscious decision to improve my typing habits. You see, I had a dirty habit. Despite spending literally decades typing on a daily basis, I’ve been a weak typist. It wasn’t exactly finger pecking, nor did it require looking down at the keyboard as I typed, but rather a six-finger dance I developed organically over the years. My technique was optimized towards Emacs’ frequent use of CTRL and ALT combinations, avoiding most of the hand scrunching. It was fast enough to keep up with my thinking most of the time, but was ultimately limiting due to its poor accuracy. I was hitting the wrong keys far too often.

My prime motivation was to learn Vim — or, more specifically, to learn modal editing. Lots of people swear by it, including people whose opinions I hold in high regard. The modal editing community is without a doubt larger than the Emacs community, especially since, thanks to Viper and Evil, a subset of the Emacs community is also part of the modal editing community. There’s obviously something significantly valuable about it, and I wanted to understand what that was.

But I was a lousy typist who couldn’t hit the right keys often enough to make effective use of modal editing. I would need to learn touch typing first.

Touch typing

How would I learn? Well, the first search result for “online touch typing course” was Typing Club, so that’s what I went with. By the way, here’s my official review: “Good enough not to bother checking out the competition.” For a website it’s pretty much the ultimate compliment, but it’s not exactly the sort of thing you’d want to hear from your long-term partner.

My hard rule was that I would immediately abandon my old habits cold turkey. Poor typing is a bad habit just like smoking, minus the cancer and weakened sense of smell. It was vital that I unlearn all that old muscle memory. That included not just my six-finger dance, but also my NetHack muscle memory. NetHack uses “hjkl” for navigation just like Vim. The problem was that I’d spent a couple hundred hours in NetHack over the past decade with my index finger on “h”, not the proper home row location. It was disorienting to navigate around Vim initally, like riding a bicycle with inverted controls.

Based on reading other people’s accounts, I determined I’d need several days of introductory practice where I’d be utterly unproductive. I took a three-day weekend, starting my touch typing lessons on a Thursday evening. Boy, they weren’t kidding about it being slow going. It was a rough weekend. When checking in on my practice, my wife literally said she pitied me. Ouch.

By Monday I was at a level resembling a very slow touch typist. For the rest of the first week I followed all the lessons up through the number keys, never progressing past an exercise until I had exceeded the target speed with at least 90% accuracy. This was now enough to get me back on my feet for programming at a glacial, frustrating pace. Programming involves a lot more numbers and symbols than other kinds of typing, making that top row so important. For a programmer, it would probably be better for these lessons to be earlier in the series.

For that first week I mostly used Emacs while I was finding my feet (or finding my fingers?). That’s when I experienced first hand what all these non-Emacs people — people who I, until recently, considered to be unenlightened simpletons — had been complaining about all these years: Pressing CTRL and ALT key combinations from the home row is a real pain in in the ass! These complaints were suddenly making sense. I was already seeing the value of modal editing before I even started really learning Vim. It made me look forward to it even more.

During the second week of touch typing I went though Derek Wyatt’s Vim videos and learned my way around the :help system enough to bootstrap my Vim education. I then read through the user manual, practicing along the way. I’ll definitely have to pass through it a few more times to pick up all sorts of things that didn’t stick. This is one way that Emacs and Vim are a lot alike.

Update: Practical Vim: Edit Text at the Speed of Thought was recommended in the comments, and it’s certainly a better place to start than the Vim user manual. Unlike the manual, it’s opinionated and focuses on good habits, which is exactly what a newbie needs.

One of my rules when learning Vim was to resist the urge to remap keys. I’ve done it a lot with Emacs: “Hmm, that’s not very convenient. I’ll change it.” It means my Emacs configuration is fairly non-standard, and using Emacs without my configuration is like using an unfamiliar editor. This is both good and bad. The good is that I’ve truly changed Emacs to be my editor, suited just for me. The bad is that I’m extremely dependent on my configuration. What if there was a text editing emergency?

With Vim as a sort of secondary editor, I want to be able to fire it up unconfigured and continue to be nearly as productive. A pile of remappings would prohibit this. In my mind this is like a form of emergency preparedness. Other people stock up food and supplies. I’m preparing myself to sit at a strange machine without any of my configuration so that I can start the rewrite of the software lost in the disaster, so long as that machine has vi, cc, and make. If I can’t code in C, then what’s the point in surviving anyway?

The other reason is that I’m just learning. A different mapping might seem more appropriate, but what do I know at this point? It’s better to follow the beaten path at first, lest I form a bunch of bad habits again. Trust in the knowledge of the ancients.

Future directions

I am absolutely sticking with modal editing for the long term. I’m really enjoying it so far. At three weeks of touch typing and two weeks of modal editing, I’m around 80% caught back up with my old productivity speed, but this time I’ve got a lot more potential for improvement.

For now, Vim will continue taking over more and more of my text editing work. My last three articles were written in Vim. It’s really important to keep building proficiency. I still rely on Emacs for email and for syndication feeds, and that’s not changing any time soon. I also really like Magit as a Git interface. Plus I don’t want to abandon years of accumulated knowledge and leave the users of my various Emacs packages out to dry. Ultimately I believe will end up using Evil, to get what seems to be the best of both worlds: modal editing and Emacs’ rich extensibility.

How to Write Portable C Without Complicating Your Build

Suppose you’re writing a non-GUI C application intended to run on a number of operating systems: Linux, the various BSDs, macOS, classical unix, and perhaps even something as exotic as Windows. It might sound like a rather complicated problem. These operating systems have slightly different interfaces (or very different in one case), and they run different variants of the standard unix tools — a problem for portable builds.

With some up-front attention to detail, this is actually not terribly difficult. Unix-like systems are probably the least diverse and least buggy they’ve ever been. Writing portable code is really just a matter of coding to the standards and ignoring extensions unless absolutely necessary. Knowing what’s standard and what’s extension is the tricky part, but I’ll explain how to find this information.

You might be tempted to reach for an overly complicated solution such as GNU Autoconf. Sure, it creates a configure script with the familiar, conventional interface. This has real value. But do you really need to run a single-threaded gauntlet of hundreds of feature/bug tests for things that sometimes worked incorrectly in some weird unix variant back in the 1990s? On a machine with many cores (parallel build, -j), this may very well be the slowest part of the whole build process.

For example, the configure script for Emacs checks that the compiler supplies stdlib.h, string.h, and getenv — things that were standardized nearly 30 years ago. It also checks for a slew of POSIX functions that have been standard since 2001.

There’s a much easier solution: Document that the application requires, say, C99 and POSIX.1-2001. It’s the responsibility of the person building the application to supply these implementations, so there’s no reason to waste time testing for it.

How to code to the standards

Suppose there’s some function you want to use, but you’re not sure if it’s standard or an extension. Or maybe you don’t know what standard it comes from. Luckily the man pages document this stuff very well, especially on Linux. Check the friendly “CONFORMING TO” section. For example, look at getenv(3). Here’s what that section has to say:

    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.

This says this function comes from the original C standard. It’s always available on anything that claims to be a C implementation. The man page also documents secure_getenv(), which is a GNU extension: to be avoided in anything intended to be portable.

What about sleep(3)?


This function isn’t part of standard C, but it’s available on any system claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the program needs to run on an operating system not implementing this POSIX standard (i.e. Windows), you’ll need to call an alternative function, probably inside a different #if .. #endif branch. More on this in a moment.

If you’re coding to POSIX, you must define the _POSIX_C_SOURCE feature test macro to the standard you intend to use prior to any system header includes:

A POSIX-conforming application should ensure that the feature test macro _POSIX_C_SOURCE is defined before inclusion of any header.

For example, to properly access POSIX.1-2001 functions in your application, define _POSIX_C_SOURCE to 200112L. With this defined, it’s safe to assume access to all of C and everything from that standard of POSIX. You can do this at the top of your sources, but I personally like the tidiness of a global config.h that gets included before everything.

How to create a portable build

So you’ve written clean, portable C to the standards. How do you build this application? The natural choice is make. It’s available everywhere and it’s part of POSIX.

Again, the tricky part is teasing apart the standard from the extension. I’m a long-time sinner in this regard, having far too often written Makefiles that depend on GNU Make extensions. This is a real pain when building programs on systems without the GNU utilities. I’ve been making amends (and finding some bugs as a result).

No implementation makes the division clear in its documentation, and especially don’t bother looking at the GNU Make manual. Your best resource is the standard itself. If you’re already familiar with make, coding to the standard is largely a matter of unlearning the various extensions you know.

Outside of some hacks, this means you don’t get conditionals (if, else, etc.). With some practice, both with sticking to portable code and writing portable Makefiles, you’ll find that you don’t really need them. Following the macro conventions will cover most situations. For example:

You don’t need to do anything weird with the assignments. The user invoking make can override them easily. For example, here’s part of a Makefile:

CC     = c99
CFLAGS = -Wall -Wextra -Os

But the user wants to use clang, and their system needs to explicitly link -lsocket (e.g. Solaris). The user can override the macro definitions on the command line:

$ make CC=clang LDLIBS=-lsocket

The same rules apply to the programs you invoke from the Makefile. Read the standards documents and ignore your system’s man pages as to avoid accidentally using an extension. It’s especially valuable to learn the Bourne shell language and avoid any accidental bashisms in your Makefiles and scripts. The dash shell is good for testing your scripts.

Makefiles conforming to the standard will, unfortunately, be more verbose than those taking advantage of a particular implementation. If you know how to code Bourne shell — which is not terribly difficult to learn — then you might even consider hand-writing a configure script to generate the Makefile (a la metaprogramming). This gives you a more flexible language with conditionals, and, being generated, redundancy in the Makefile no longer matters.

As someone who frequently dabbles with BSD systems, my life has gotten a lot easier since learning to write portable Makefiles and scripts.

But what about Windows

It’s the elephant in the room and I’ve avoided talking about it so far. If you want to build with Visual Studio’s command line tools — something I do on occasion — build portability goes out the window. Visual Studio has nmake.exe, which nearly conforms to POSIX make. However, without the standard unix utilities and with the completely foreign compiler interface for cl.exe, there’s absolutely no hope of writing a Makefile portable to this situation.

The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the unix utilities, though it has the problem of linking against msvcrt.dll. Another option is a separate Makefile dedicated to nmake.exe and the Visual Studio toolchain. Good luck defining a correctly working “clean” target with del.exe.

My preferred approach lately is an amalgamation build (as seen in Enchive): Carefully concatenate all the application’s sources into one giant source file. First concatenate all the headers in the right order, followed by all the C files. Use sed to remove and local includes. You can do this all on a unix system with the nice utilities, then point cl.exe at the amalgamation for the Visual Studio build. It’s not very useful for actual development (i.e. you don’t want to edit the amalgamation), but that’s what MinGW-w64 resolves.

What about all those POSIX functions? You’ll need to find Win32 replacements on MSDN. I prefer to do this is by abstracting those operating system calls. For example, compare POSIX sleep(3) and Win32 Sleep().

#if defined(_WIN32)
#include <windows.h>

my_sleep(int s)
    Sleep(s * 1000);  // TODO: handle overflow, maybe

#else /* __unix__ */
#include <unistd.h>

my_sleep(int s)
    sleep(s);  // TODO: fix signal interruption

Then the rest of the program calls my_sleep(). There’s another example in the OpenMP article with pwrite(2) and WriteFile(). This demonstrates that supporting a bunch of different unix-like systems is really easily, but introducing Windows portability adds a disproportionate amount of complexity.

Caveat: paths and filenames

There’s one major complication with filenames for applications portable to Windows. In the unix world, filenames are null-terminated bytestrings. Typically these are Unicode strings encoded as UTF-8, but it’s not necessarily so. The kernel just sees bytestrings. A bytestring doesn’t necessarily have a formal Unicode representation, which can be a problem for languages that want filenames to be Unicode strings (also).

On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up being neither. They’re really null-terminated unsigned 16-bit integer arrays. It’s almost UTF-16 except that Windows allows unpaired surrogates. This means Windows filenames also don’t have a formal Unicode representation, but in a completely different way than unix. Some heroic efforts have gone into working around this issue.

As a result, it’s highly non-trivial to correctly support all possible filenames on both systems in the same program, especially when they’re passed as command line arguments.


The key points are:

  1. Document the standards your application requires and strictly stick to them.
  2. Ignore the vendor documentation if it doesn’t clearly delineate extensions.

This was all a discussion of non-GUI applications, and I didn’t really touch on libraries. Many libraries are simple to access in the build (just add it to LDLIBS), but some libraries — GUIs in particular — are particularly complicated to manage portably and will require a more complex solution (pkg-config, CMake, Autoconf, etc.).

Why I've Retired My PGP Keys and What's Replaced It

tl;dr: Enchive (rhymes with “archive”) has replaced my use of GnuPG.

Two weeks ago I tried to encrypt a tax document for archival and noticed my PGP keys had just expired. GnuPG had (correctly) forbidden the action, requiring that I first edit the key and extend the expiration date. Rather than do so, I decided to take this opportunity to retire my PGP keys for good. Over time I’ve come to view PGP as largely a failure — it never reached the critical mass, the tooling has always been problematic, and it’s now a dead end. The only thing it’s been successful at is signing Linux packages, and even there it could be replaced with something simpler and better.

I still have a use for PGP: encrypting sensitive files to myself for long term storage. I’ve also been using it to consistently to sign Git tags for software releases. However, very recently this lost its value, though I doubt anyone was verifying these signatures anyway. It’s never been useful for secure email, especially when most people use it incorrectly. I only need to find a replacement for archival encryption.

I could use an encrypted filesystem, but which do I use? I use LUKS to protect my laptop’s entire hard drive in the event of a theft, but for archival I want something a little more universal. Basically I want the following properties:

I couldn’t find anything that fit the bill, so I did exactly what you’re not supposed to do and rolled my own: Enchive. It was loosely inspired by OpenBSD’s signify. It has the tiny subset of PGP features that I need — using modern algorithms — plus one more feature I’ve always wanted: the ability to generate a keypair from a passphrase. This means I can reliably access my archive keypair anywhere without doing something strange like uploading my private keys onto the internet.

On Enchive

Here’s where I’d put the usual disclaimer about not using it for anything serious, blah blah blah. But really, I don’t care if anyone else uses Enchive. It exists just to scratch my own personal itch. If you have any doubts, don’t use it. I’m putting it out there in case anyone else is in the same boat. It would also be nice if any glaring flaws I may have missed were pointed out.

Not expecting it to be available as a nice package, I wanted to make it trivial to build Enchive anywhere I’d need it. Except for including stdint.h in exactly one place to get the correct integers for crypto, it’s written in straight C89. All the crypto libraries are embedded, and there are no external dependencies. There’s even an “amalgamation” build, so make isn’t required: just point your system’s cc at it and you’re done.


For encryption, Enchive uses Curve25519, ChaCha20, and HMAC-SHA256.

Rather than the prime-number-oriented RSA as used in classical PGP (yes, GPG 2 can do better), Curve25519 is used for the asymmetric cryptography role, using the relatively new elliptic curve cryptography. It’s stronger cryptography and the keys are much smaller. It’s a Diffie-Hellman function — an algorithm used to exchange cryptographic keys over a public channel — so files are encrypted by generating an ephemeral keypair and using this ephemeral keypair to perform a key exchange with the master keys. The ephemeral public key is included with the encrypted file and the ephemeral private key is discarded.

I used the “donna” implementation in Enchive. Despite being the hardest to understand (mathematically), this is the easiest to use. It’s literally just one function of two arguments to do everything.

Curve25519 only establishes the shared key, so next is the stream cipher ChaCha20. It’s keyed by the shared key to actually encrypt the data. This algorithm has the same author as Curve25519 (djb), so it’s natural to use these together. It’s really straightforward, so there’s not much to say about it.

For the Message Authentication Code (MAC), I chose HMAC-SHA256. It prevents anyone from modifying the message. Note: This doesn’t prevent anyone who knows the master public key from replacing the file wholesale. That would be solved with a digital signature, but this conflicts with my goal of encrypting files without the need of my secret key. The MAC goes at the end of the file, allowing arbitrarily large files to be encrypted single-pass as a stream.

There’s a little more to it (IV, etc.) and is described in detail in the README.


The first thing you’d do is generate a keypair. By default this is done from /dev/urandom, in which case you should immediately back them up. But if you’re like me, you’ll be using Enchive’s --derive (-d) feature to create it from a passphrase. In that case, the keys are backed up in your brain!

$ enchive keygen --derive
secret key passphrase:
secret key passphrase (repeat):
passphrase (empty for none):
passphrase (repeat):

The first prompt is for the secret key passphrase. This is converted into a Curve25519 keypair using an scrypt-like key derivation algorithm. The process requires 512MB of memory (to foil hardware-based attacks) and takes around 20 seconds.

The second passphrase (or the only one when --derive isn’t used), is the protection key passphrase. The secret key is encrypted with this passphrase to protect it at rest. You’ll need to enter it any time you decrypt a file. The key derivation step is less aggressive for this key, but you could also crank it up if you like.

At the end of this process you’ll have two new files under $XDG_CONFIG_DIR/enchive: (32 bytes) and enchive.sec (64 bytes). The first you can distribute anywhere you’d like to encrypt files; it’s not particularly sensitive. The second is needed to decrypt files.

To encrypt a file for archival:

$ enchive archive

No prompt for passphrase. This will create

To decrypt later:

$ enchive extract

If you’ve got many files to decrypt, entering your passphrase over and over would get tiresome, so Enchive includes a key agent that keeps the protection key in memory for a period of time (15 minutes by default). Enable it with the --agent flag (it may be enabled by default someday).

$ enchive --agent extract

Unlike ssh-agent and gpg-agent, there’s no need to start the agent ahead of time. It’s started on demand as needed and terminates after the timeout. It’s completely painless.

Both archive and extract operate stdin to stdout when no file is given.

Feature complete

As far as I’m concerned, Enchive is feature complete. It does everything I need, I don’t want it to do anything more, and at least two of us have already started putting it to use. The interface and file formats won’t change unless someone finds a rather significant flaw. There is some wiggle room to replace the algorithms in the future should Enchive have that sort of longevity.

OpenMP and pwrite()

The most common way I introduce multi-threading to small C programs is with OpenMP (Open Multi-Processing). It’s typically used as compiler pragmas to parallelize computationally expensive loops — iterations are processed by different threads in some arbitrary order.

Here’s an example that computes the frames of a video in parallel. Despite being computed out of order, each frame is written in order to a large buffer, then written to standard output all at once at the end.

size_t size = sizeof(struct frame) * num_frames;
struct frame *output = malloc(size);
float beta = DEFAULT_BETA;

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    float theta = compute_theta(i);
    compute_frame(&output[i], theta, beta);

write(STDOUT_FILENO, output, size);

Adding OpenMP to this program is much simpler than introducing low-level threading semantics with, say, Pthreads. With care, there’s often no need for explicit thread synchronization. It’s also fairly well supported by many vendors, even Microsoft (up to OpenMP 2.0), so a multi-threaded OpenMP program is quite portable without #ifdef.

There’s real value this pragma API: The above example would still compile and run correctly even when OpenMP isn’t available. The pragma is ignored and the program just uses a single core like it normally would. It’s a slick fallback.

When a program really does require synchronization there’s omp_lock_t (mutex lock) and the expected set of functions to operate on them. This doesn’t have the nice fallback, so I don’t like to use it. Instead, I prefer #pragma omp critical. It nicely maintains the OpenMP-unsupported fallback.

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    struct frame *frame = malloc(sizeof(*frame));
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    #pragma omp critical
        write(STDOUT_FILENO, frame, sizeof(*frame));

This would append the output to some output file in an arbitrary order. The critical section prevents interleaving of outputs.

There are a couple of problems with this example:

  1. Only one thread can write at a time. If the write takes too long, other threads will queue up behind the critical section and wait.

  2. The output frames will be out of order, which is probably inconvenient for consumers. If the output is seekable this can be solved with lseek(), but that only makes the critical section even more important.

There’s an easy fix for both, and eliminates the need for a critical section: POSIX pwrite().

ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

It’s like write() but has an offset parameter. Unlike lseek() followed by a write(), multiple threads and processes can, in parallel, safely write to the same file descriptor at different file offsets. The catch is that the output must be a file, not a pipe.

#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    size_t size = sizeof(struct frame);
    struct frame *frame = malloc(size);
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    pwrite(STDOUT_FILENO, frame, size, size * i);

There’s no critical section, the writes can interleave, and the output is in order.

If you’re concerned about standard output not being seekable (it often isn’t), keep in mind that it will work just fine when invoked like so:

$ ./compute_frames > frames.ppm


I talked about OpenMP being really portable, then used POSIX functions. Fortunately the Win32 WriteFile() function has an “overlapped” parameter that works just like pwrite(). Typically rather than call either directly, I’d wrap the write like so:

#ifdef _WIN32
#include <windows.h>

static int
write_frame(struct frame *f, int i)
    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD written;
    OVERLAPPED offset = {.Offset = sizeof(*f) * i};
    return WriteFile(out, f, sizeof(*f), &written, &offset);

#else /* POSIX */
#include <unistd.h>

static int
write_frame(struct frame *f, int i)
    size_t count = sizeof(*f);
    size_t offset = sizeof(*f) * i;
    return pwrite(STDOUT_FILENO, buf, count, offset) == count;

Except for switching to write_frame(), the OpenMP part remains untouched.

Real World Example

Here’s an example in a real program:


Notice because of pwrite() there’s no piping directly into ppmtoy4m:

$ ./julia > output.ppm
$ ppmtoy4m -F 60:1 < output.ppm > output.y4m
$ x264 -o output.mp4 output.y4m


Asynchronous Requests from Emacs Dynamic Modules

A few months ago I had a discussion with Vladimir Kazanov about his Orgfuse project: a Python script that exposes an Emacs Org-mode document as a FUSE filesystem. It permits other programs to navigate the structure of an Org-mode document through the standard filesystem APIs. I suggested that, with the new dynamic modules in Emacs 25, Emacs itself could serve a FUSE filesystem. In fact, support for FUSE services in general could be an package of his own.

So that’s what he did: Elfuse. It’s an old joke that Emacs is an operating system, and here it is handling system calls.

However, there’s a tricky problem to solve, an issue also present my joystick module. Both modules handle asynchronous events — filesystem requests or joystick events — but Emacs runs the event loop and owns the main thread. The external events somehow need to feed into the main event loop. It’s even more difficult with FUSE because FUSE also wants control of its own thread for its own event loop. This requires Elfuse to spawn a dedicated FUSE thread and negotiate a request/response hand-off.

When a filesystem request or joystick event arrives, how does Emacs know to handle it? The simple and obvious solution is to poll the module from a timer.

struct queue requests;

Frequest_next(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
    emacs_value next = Qnil;
    if (queue_length(requests) > 0) {
        void *request = queue_pop(requests, env);
        next = env->make_user_ptr(env, fin_empty, request);
    return next;

And then ask Emacs to check the module every, say, 10ms:

(defun request--poll ()
  (let ((next (request-next)))
    (when next
      (request-handle next))))

(run-at-time 0 0.01 #'request--poll)

Blocking directly on the module’s event pump with Emacs’ thread would prevent Emacs from doing important things like, you know, being a text editor. The timer allows it to handle its own events uninterrupted. It gets the job done, but it’s far from perfect:

  1. It imposes an arbitrary latency to handling requests. Up to the poll period could pass before a request is handled.

  2. Polling the module 100 times per second is inefficient. Unless you really enjoy recharging your laptop, that’s no good.

The poll period is a sliding trade-off between latency and battery life. If only there was some mechanism to, ahem, signal the Emacs thread, informing it that a request is waiting…


Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals, which is exactly the mechanism we need. The interface is a “key” binding on special-event-map, the keymap that handles these kinds of events. When the signal arrives, Emacs queues it up for the main event loop.

(define-key special-event-map [sigusr1]
  (lambda ()
    (request-handle (request-next))))

The module blocks on its own thread on its own event pump. When a request arrives, it queues the request, rings the bell for Emacs to come handle it (raise()), and waits on a semaphore. For illustration purposes, assume the module reads requests from and writes responses to a file descriptor, like a socket.

int event_fd = /* ... */;
struct request request;
sem_init(&request.sem, 0, 0);

for (;;) {
    /* Blocking read for request event */
    read(event_fd, &request.event, sizeof(request.event));

    /* Put request on the queue */
    queue_push(requests, &request);
    raise(SIGUSR1);  // TODO: Should raise() go inside the lock?

    /* Wait for Emacs */
    while (sem_wait(&request.sem))

    /* Reply with Emacs' response */
    write(event_fd, &request.response, sizeof(request.response));

The sem_wait() is in a loop because signals will wake it up prematurely. In fact, it may even wake up due to its own signal on the line before. This is the only way this particular use of sem_wait() might fail, so there’s no need to check errno.

If there are multiple module threads making requests to the same global queue, the lock is necessary to protect the queue. The semaphore is only for blocking the thread until Emacs has finished writing its particular response. Each thread has its own semaphore.

When Emacs is done writing the response, it releases the module thread by incrementing the semaphore. It might look something like this:

Frequest_complete(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
    struct request *request = env->get_user_ptr(env, args[0]);
    if (request)
    return Qnil;

The top-level handler dispatches to the specific request handler, calling request-complete above when it’s done.

(defun request-handle (next)
  (condition-case e
      (cl-ecase (request-type next)
        (:open  (request-handle-open  next))
        (:close (request-handle-close next))
        (:read  (request-handle-read  next)))
    (error (request-respond-as-error next e)))

This SIGUSR1+semaphore mechanism is roughly how Elfuse currently processes requests.


Windows doesn’t have signals. This isn’t a problem for Elfuse since Windows doesn’t have FUSE either. Nor does it matter for Joymacs since XInput isn’t event-driven and always requires polling. But someday someone will need this mechanism for a dynamic module on Windows.

Fortunately there’s a solution: input language change events, WM_INPUTLANGCHANGE. It’s also on special-event-map:

(define-key special-event-map [language-change]
  (lambda ()
    (request-process (request-next))))

Instead of raise() (or pthread_kill()), broadcast the window event with PostMessage(). Outside of invoking the language-change key binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t belong to any particular window. We don’t really want to change the input language, after all.


Naturally you’ll also need to replace the POSIX threading primitives with the Windows versions (CreateThread(), CreateSemaphore(), etc.). With a bit of abstraction in the right places, it should be pretty easy to support both POSIX and Windows in these asynchronous dynamic module events.

How to Write Fast(er) Emacs Lisp

Not everything written in Emacs Lisp needs to be fast. Most of Emacs itself — around 82% — is written in Emacs Lisp because those parts are generally not performance-critical. Otherwise these functions would be built-ins written in C. Extensions to Emacs don’t have a choice and — outside of a few exceptions like dynamic modules and inferior processes — must be written in Emacs Lisp, including their performance-critical bits. Common performance hot spots are automatic indentation, AST parsing, and interactive completion.

Here are 5 guidelines, each very specific to Emacs Lisp, that will result in faster code. The non-intrusive guidelines could be applied at all times as a matter of style — choosing one equally expressive and maintainable form over another just because it performs better.

There’s one caveat: These guidelines are focused on Emacs 25.1 and “nearby” versions. Emacs is constantly evolving. Changes to the virtual machine and byte-code compiler may transform currently-slow expressions into fast code, obsoleting some of these guidelines. In the future I’ll add notes to this article for anything that changes.

(1) Use lexical scope

This guideline refers to the following being the first line of every Emacs Lisp source file you write:

;;; -*- lexical-binding: t; -*-

This point is worth mentioning again and again. Not only will your code be more correct, it will be measurably faster. Dynamic scope is still opt-in through the explicit use of special variables, so there’s absolutely no reason not to be using lexical scope. If you’ve written clean, dynamic scope code, then switching to lexical scope won’t have any effect on its behavior.

Along similar lines, special variables are a lot slower than local, lexical variables. Only use them when necessary.

(2) Prefer built-in functions

Built-in functions are written in C and are, as expected, significantly faster than the equivalent written in Emacs Lisp. Complete as much work as possible inside built-in functions, even if it might mean taking more conceptual steps overall.

For example, what’s the fastest way to accumulate a list of items? That is, new items go on the tail but, for algorithm reasons, the list must be constructed from the head.

You might be tempted to keep track of the tail of the list, appending new elements directly to the tail with setcdr (via setf below).

(defun fib-track-tail (n)
  (let* ((a 0)
         (b 1)
         (head (list 1))
         (tail head))
    (dotimes (_ n head)
      (psetf a b
             b (+ a b))
      (setf (cdr tail) (list b)
            tail (cdr tail)))))

(fib-track-tail 8)
;; => (1 1 2 3 5 8 13 21 34)

Actually, it’s much faster to construct the list in reverse, then destructively reverse it at the end.

(defun fib-nreverse (n)
  (let* ((a 0)
         (b 1)
         (list (list 1)))
    (dotimes (_ n (nreverse list))
      (psetf a b
             b (+ a b))
      (push b list))))

It might not look it, but nreverse is very fast. Not only is it a built-in, it’s got its own opcode. Using push in a loop, then finishing with nreverse is the canonical and fastest way to accumulate a list of items.

In fib-track-tail, the added complexity of tracking the tail in Emacs Lisp is much slower than zipping over the entire list a second time in C.

(3) Avoid unnecessary lambda functions

I’m talking about mapcar and friends.

;; Slower
(defun expt-list (list e)
  (mapcar (lambda (x) (expt x e)) list))

Listen, I know you love dash.el and higher order functions, but this habit ain’t cheap. The byte-code compiler does not know how to inline these lambdas, so there’s an additional per-element function call overhead.

Worse, if you’re using lexical scope like I told you, the above example forms a closure over e. This means a new function object is created (e.g. make-byte-code) each time expt-list is called. To be clear, I don’t mean that the lambda is recompiled each time — the same byte-code string is shared between all instances of the same lambda. A unique function vector (#[...]) and constants vector are allocated and initialized each time expt-list is invoked.

Related mini-guideline: Don’t create any more garbage than strictly necessary in performance-critical code.

Compare to an implementation with an explicit loop, using the nreverse list-accumulation technique.

(defun expt-list-fast (list e)
  (let ((result ()))
    (dolist (x list (nreverse result))
      (push (expt x e) result))))

This is the fastest possible definition for this function, and it’s what you need to use in performance-critical code.

Personally I prefer the list comprehension approach, using cl-loop from cl-lib.

(defun expt-list-fast (list e)
  (cl-loop for x in list
           collect (expt x e)))

The cl-loop macro will expand into essentially the previous definition, making them practically equivalent. It takes some getting used to, but writing efficient loops is a whole lot less tedious with cl-loop.

In Emacs 24.4 and earlier, catch/throw is implemented by converting the body of the catch into a lambda function and calling it. If code inside the catch accesses a variable outside the catch (very likely), then, in lexical scope, it turns into a closure, resulting in the garbage function object like before.

In Emacs 24.5 and later, the byte-code compiler uses a new opcode, pushcatch. It’s a whole lot more efficient, and there’s no longer a reason to shy away from catch/throw in performance-critical code. This is important because it’s often the only way to perform an early bailout.

(4) Prefer using functions with dedicated opcodes

When following the guideline about using built-in functions, you might have several to pick from. Some built-in functions have dedicated virtual machine opcodes, making them much faster to invoke. Prefer these functions when possible.

How can you tell when a function has an assigned opcode? Take a peek at the byte-defop listings in bytecomp.el. Optimization often involves getting into the weeds, so don’t be shy.

For example, the assq and assoc functions search for a matching key in an association list (alist). Both are built-in functions, and the only difference is that the former compares keys with eq (e.g. symbol or integer keys) and the latter with equal (typically string keys). The difference in performance between eq and equal isn’t as important as another factor: assq has its own opcode (158).

This means in performance-critical code you should prefer assq, perhaps even going as far as restructuring your alists specifically to have eq keys. That last step is probably a trade-off, which means you’ll want to make some benchmarks to help with that decision.

Another example is eq, =, eql, and equal. Some macros and functions use eql, especially cl-lib which inherits eql as a default from Common Lisp. Take cl-case, which is like switch from the C family of languages. It compares elements with eql.

(defun op-apply (op a b)
  (cl-case op
    (:norm (+ (* a a) (* b b)))
    (:disp (abs (- a b)))
    (:isin (/ b (sin a)))))

The cl-case expands into a cond. Since Emacs byte-code lacks support for jump tables, there’s not much room for cleverness.

(defun op-apply (op a b)
   ((eql op :norm) (+ (* a a) (* b b)))
   ((eql op :disp) (abs (- a b)))
   ((eql op :isin) (/ b (sin a)))))

It turns out eql is pretty much always the worst choice for cl-case. Of the four equality functions I listed, the only one lacking an opcode is eql. A faster definition would use eq. (In theory, cl-case could have done this itself because it knows all the keys are symbols.)

(defun op-apply (op a b)
   ((eq op :norm) (+ (* a a) (* b b)))
   ((eq op :disp) (abs (- a b)))
   ((eq op :isin) (/ b (sin a)))))

Fortunately eq can safely compare integers in Emacs Lisp. You only need eql when comparing symbols, integers, and floats all at once, which is unusual.

(5) Unroll loops using and/or

Consider the following function which checks its argument against a list of numbers, bailing out on the first match. I used % instead of mod since the former has an opcode (166) and the latter does not.

(defun detect (x)
  (catch 'found
    (dolist (f '(2 3 5 7 11 13 17 19 23 29 31))
      (when (= 0 (% x f))
        (throw 'found f)))))

The byte-code compiler doesn’t know how to unroll loops. Fortunately that’s something we can do for ourselves using and and or. The compiler will turn this into clean, efficient jumps in the byte-code.

(defun detect-unrolled (x)
  (or (and (= 0 (% x 2)) 2)
      (and (= 0 (% x 3)) 3)
      (and (= 0 (% x 5)) 5)
      (and (= 0 (% x 7)) 7)
      (and (= 0 (% x 11)) 11)
      (and (= 0 (% x 13)) 13)
      (and (= 0 (% x 17)) 17)
      (and (= 0 (% x 19)) 19)
      (and (= 0 (% x 23)) 23)
      (and (= 0 (% x 29)) 29)
      (and (= 0 (% x 31)) 31)))

In Emacs 24.4 and earlier with the old-fashioned lambda-based catch, the unrolled definition is seven times faster. With the faster pushcatch-based catch it’s about twice as fast. This means the loop overhead accounts for about half the work of the first definition of this function.

Update: It was pointed out in the comments that this particular example is equivalent to a cond. That’s literally true all the way down to the byte-code, and it would be a clearer way to express the unrolled code. In real code it’s often not quite equivalent.

Unlike some of the other guidelines, this is certainly something you’d only want to do in code you know for sure is performance-critical. Maintaining unrolled code is tedious and error-prone.

I’ve had the most success with this approach by not by unrolling these loops myself, but by using a macro, or similar, to generate the unrolled form.

(defmacro with-detect (var list)
  (cl-loop for e in list
           collect `(and (= 0 (% ,var ,e)) ,e) into conditions
           finally return `(or ,@conditions)))

(defun detect-unrolled (x)
  (with-detect x (2 3 5 7 11 13 17 19 23 29 31)))

How can I find more optimization opportunities myself?

Use M-x disassemble to inspect the byte-code for your own hot spots. Observe how the byte-code changes in response to changes in your functions. Take note of the sorts of forms that allow the byte-code compiler to produce the best code, and then exploit it where you can.

Manual Control Flow Guard in C

Recent versions of Windows have a new exploit mitigation feature called Control Flow Guard (CFG). Before an indirect function call — e.g. function pointers and virtual functions — the target address checked against a table of valid call addresses. If the address isn’t the entry point of a known function, then the program is aborted.

If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.

The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.

For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.

CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.

While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.

An exploit example

Consider this trivial C program, demo.c:

    char name[8];
    printf("Hello, %s.\n", name);
    return 0;

It reads a name into a buffer and prints it back out with a greeting. While trivial, it’s far from innocent. That naive call to gets() doesn’t check the bounds of the buffer, introducing an exploitable buffer overflow. It’s so obvious that both the compiler and linker will yell about it.

For simplicity, suppose the program also contains a dangerous function.

    puts("**** GO BOOM! ****");

The attacker can use the buffer overflow to call this dangerous function.

To make this attack simpler for the sake of the article, assume the program isn’t using ASLR (e.g. without -fpie/-pie, or with -fno-pie/-no-pie). For this particular example, I’ll also explicitly disable buffer overflow protections (e.g. _FORTIFY_SOURCE and stack protectors).

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
      -o demo demo.c

First, find the address of self_destruct().

$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct

This is on x86-64, so it’s a 64-bit address. The size of the name buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes allocated above, so there’s 16 bytes to fill, then 8 bytes to overwrite the return pointer with the address of self_destruct.

$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault

With this input I’ve successfully exploited the buffer overflow to divert control to self_destruct(). When main tries to return into libc, it instead jumps to the dangerous function, and then crashes when that function tries to return — though, presumably, the system would have self-destructed already. Turning on the stack protector stops this exploit.

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
      -o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...

The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.

The stack protector transformed the program into something that looks like the following:

    long __canary = __get_thread_canary();
    char name[8];
    printf("Hello, %s.\n", name);
    if (__canary != __get_thread_canary())
    return 0;

However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.

Function pointers and virtual functions

After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:

self_destruct(char *password)
    if (strcmp(password, "12345") == 0)
        puts("**** GO BOOM! ****");

The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.

Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.

struct greeter {
    char name[8];
    void (*greet)(struct greeter *);

greet_hello(struct greeter *g)
    printf("Hello, %s.\n", g->name);

greet_aloha(struct greeter *g)
    printf("Aloha, %s.\n", g->name);

There’s now a greeter object and the function pointer makes its behavior polymorphic. Think of it as a hand-coded virtual function for C. Here’s the new (contrived) main:

    struct greeter greeter = {.greet = greet_hello};
    return 0;

(In a real program, something else provides greeter and picks its own function pointer for greet.)

Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.

$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct

We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).

$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****

Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.

This is where the Control Flow Guard comes into play. With CFG enabled, the compiler inserts a check before calling the greet() function pointer. It must point to the beginning of a known function, otherwise it will abort just like the stack protector. Since the middle of self_destruct() isn’t the beginning of a function, it would abort if this exploit is attempted.

However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.

Function address bitmap

As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.

For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.

These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.

Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.

Here are the CFG parameters. I’ve made them macros so that they can easily be tuned at compile-time. The cfg_bits is the integer type backing the bitmap array. The CFG_RESOLUTION is the number of bits dropped, so “3” is a granularity of 8 bytes.

typedef unsigned long cfg_bits;
#define CFG_BITS        24

Given a function pointer f, this macro extracts the bitmap index.

#define CFG_INDEX(f) \
    (((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))

The CFG bitmap is just an array of integers. Zero it to initialize.

struct cfg {
    cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];

Functions are manually registered in the bitmap using cfg_register().

cfg_register(struct cfg *cfg, void *f)
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    cfg->bitmap[i / z] |= 1UL << (i % z);

Because functions are registered at run-time, it’s fully compatible with ASLR. If ASLR is enabled, the bitmap will be a little different each run. On the same note, it may be worth XORing each bitmap element with a random, run-time value — along the same lines as the stack canary value — to make it harder for an attacker to manipulate the bitmap should he get the ability to overwrite it by a vulnerability. Alternatively the bitmap could be switched to read-only (e.g. mprotect()) once everything is registered.

And finally, the check function, used immediately before indirect calls. It ensures f was previously passed to cfg_register() (except for false positives, as discussed). Since it will be invoked often, it needs to be fast and simple.

cfg_check(struct cfg *cfg, void *f)
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    if (!((cfg->bitmap[i / z] >> (i % z)) & 1))

And that’s it! Now augment main to make use of it:

struct cfg cfg;

    cfg_register(&cfg, self_destruct);  // to prove this works
    cfg_register(&cfg, greet_hello);
    cfg_register(&cfg, greet_aloha);

    struct greeter greeter = {.greet = greet_hello};
    cfg_check(&cfg, greeter.greet);
    return 0;

And now attempting the exploit:

$ ./demo < boom

Normally self_destruct() wouldn’t be registered since it’s not a legitimate target of an indirect call, but the exploit still didn’t work because it called into the middle of self_destruct(), which isn’t a valid address in the bitmap. The check aborts the program before it can be exploited.

In a real application I would have a global cfg bitmap for the whole program, and define cfg_check() in a header as an inline function.

Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.

Update: Ted Unangst pointed out OpenBSD performing a similar check in its mbuf library. Instead of a bitmap, the function pointer is replaced with an index into an array of registered function pointers. That approach is cleaner, more efficient, completely portable, and has no false positives.

null program

Chris Wellons