Stack Clashing for Fun and Profit

Stack clashing has been in the news lately due to some recently discovered vulnerablities along with proof-of-concept exploits. As the announcement itself notes, this is not a new issue, though this appears to be the first time it’s been given this particular name. I do know of one “good” use of stack clashing, where it’s used for something productive than as part of an attack. In this article I’ll explain how it works.

You can find the complete code for this article here, ready to run:

But first, what is a stack clash? Here’s a rough picture of the typical way process memory is laid out. The stack starts at a high memory address and grows downwards. Code and static data sit at low memory, with a brk pointer growing upward to make small allocations. In the middle is the heap, where large allocations and memory mappings take place.

Below the stack is a slim guard page that divides the stack and the region of memory reserved for the heap. Reading or writing to that memory will trap, causing the program to crash or some special action to be taken. The goal is to prevent the stack from growing into the heap, which could cause all sorts of trouble, like security issues.

The problem is that this thin guard page isn’t enough. It’s possible to put a large allocation on the stack, never read or write to it, and completely skip over the guard page, such that the heap and stack overlap without detection.

Once this happens, writes into the heap will change memory on the stack and vice versa. If an attacker can cause the program to make such a large allocation on the stack, then legitimate writes into memory on the heap can manipulate local variables or return pointers, changing the program’s control flow. This can bypass buffer overflow protections, such as stack canaries.

Binary trees and coroutines

Now, I’m going to abruptly change topics to discuss binary search trees. We’ll get back to stack clash in a bit. Suppose we have a binary tree which we would like to iterate depth-first. For this demonstration, here’s the C interface to the binary tree.

struct tree {
    struct tree *left;
    struct tree *right;
    char *key;
    char *value;
};

void  tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void  tree_visit(struct tree *, void (*f)(char *, char *));
void  tree_destroy(struct tree *);

An empty tree is the NULL pointer, hence the double-pointer for insert. In the demonstration it’s an unbalanced search tree, but this could very well be a balanced search tree with the addition of another field on the structure.

For the traversal, first visit the root node, then traverse its left tree, and finally traverse its right tree. It makes for a simple, recursive definition — the sort of thing you’d teach a beginner. Here’s a definition that accepts a callback, which the caller will use to visit each key/value in the tree. This really is as simple as it gets.

void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
    if (t) {
        f(t->key, t->value);
        tree_visit(t->left, f);
        tree_visit(t->right, f);
    }
}

Unfortunately this isn’t so convenient for the caller, who has to split off a callback function that lacks context, then hand over control to the traversal function.

void
printer(char *k, char *v)
{
    printf("%s = %s\n", k, v);
}

void
print_tree(struct tree *tree)
{
    tree_visit(tree, printer);
}

Usually it’s much nicer for the caller if instead it’s provided an iterator, which the caller can invoke at will. Here’s an interface for it, just two functions.

struct tree_it *tree_iterator(struct tree *);
int             tree_next(struct tree_it *, char **k, char **v);

The first constructs an iterator object, and the second one visits a key/value pair each time it’s called. It returns 0 when traversal is complete, automatically freeing any resources associated with the iterator.

The caller now looks like this:

    char *k, *v;
    struct tree_it *it = tree_iterator(tree);
    while (tree_next(it, &k, &v))
        printf("%s = %s\n", k, v);

Notice I haven’t defined struct tree_it. That’s because I’ve got four different implementations, each taking a different approach. The last one will use stack clashing.

Manual State Tracking

With just the standard facilities provided by C, there’s a some manual bookkeeping that has to take place in order to convert the recursive definition into an iterator. Depth-first traversal is a stack-oriented process, and with recursion the stack is implicit in the call stack. As an iterator, the traversal stack needs to be managed explicitly. The iterator needs to keep track of the path it took so that it can backtrack, which means keeping track of parent nodes as well as which branch was taken.

Here’s my little implementation, which, to keep things simple, has a hard depth limit of 32. It’s structure definition includes a stack of node pointers, and 2 bits of information per visited node, stored across a 64-bit integer.

struct tree_it {
    struct tree *stack[32];
    unsigned long long state;
    int nstack;
};

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack[0] = t;
    it->state = 0;
    it->nstack = 1;
    return it;
}

The 2 bits track three different states for each visited node:

  1. Visit the current node
  2. Traverse the left tree
  3. Traverse the right tree

It works out to the following. Don’t worry too much about trying to understand how this works. My point is to demonstrate that converting the recursive definition into an iterator complicates the implementation.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    while (it->nstack) {
        int shift = (it->nstack - 1) * 2;
        int state = 3u & (it->state >> shift);
        struct tree *t = it->stack[it->nstack - 1];
        it->state += 1ull << shift;
        switch (state) {
            case 0:
                *k = t->key;
                *v = t->value;
                if (t->left) {
                    it->stack[it->nstack++] = t->left;
                    it->state &= ~(3ull << (shift + 2));
                }
                return 1;
            case 1:
                if (t->right) {
                    it->stack[it->nstack++] = t->right;
                    it->state &= ~(3ull << (shift + 2));
                }
                break;
            case 2:
                it->nstack--;
                break;
        }
    }
    free(it);
    return 0;
}

Wouldn’t it be nice to keep both the recursive definition while also getting an iterator? There’s an exact solution to that: coroutines.

Coroutines

C doesn’t come with coroutines, but there are a number of libraries available. We can also build our own coroutines. One way to do that is with user contexts (<ucontext.h>) provided by the X/Open System Interfaces Extension (XSI), an extension to POSIX. This set of functions allow programs to create their own call stacks and switch between them. That’s the key ingredient for coroutines. Caveat: These functions aren’t widely available, and probably shouldn’t be used in new code.

Here’s my iterator structure definition.

#define _XOPEN_SOURCE 600
#include <ucontext.h>

struct tree_it {
    char *k;
    char *v;
    ucontext_t coroutine;
    ucontext_t yield;
};

It needs one context for the original stack and one context for the iterator’s stack. Each time the iterator is invoked, it the program will switch to the other stack, find the next value, then switch back. This process is called yielding. Values are passed between context using the k (key) and v (value) fields on the iterator.

Before I get into initialization, here’s the actual traversal coroutine. It’s nearly the same as the original recursive definition except for the swapcontext(). This is the yield, pausing execution and sending control back to the caller. The current context is saved in the first argument, and the second argument becomes the current context.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        swapcontext(&it->coroutine, &it->yield);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

While the actual traversal is simple again, initialization is more complicated. The first problem is that there’s no way to pass pointer arguments to the coroutine. Technically only int arguments are permitted. (All the online tutorials get this wrong.) To work around this problem, I smuggle the arguments in as global variables. This would cause problems should two different threads try to create iterators at the same time, even on different trees.

static struct tree *tree_arg;
static struct tree_it *tree_it_arg;

static void
coroutine_init(void)
{
    coroutine(tree_arg, tree_it_arg);
}

The stack has to be allocated manually, which I do with a call to malloc(). Nothing fancy is needed, though this means the new stack won’t have a guard page. For the stack size, I use the suggested value of SIGSTKSZ. The makecontext() function is what creates the new context from scratch, but the new context must first be initialized with getcontext(), even though that particular snapshot won’t actually be used.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
    it->coroutine.uc_stack.ss_size = SIGSTKSZ;
    it->coroutine.uc_link = &it->yield;
    getcontext(&it->coroutine);
    makecontext(&it->coroutine, coroutine_init, 0);
    tree_arg = t;
    tree_it_arg = it;
    return it;
}

Notice I gave it a function pointer, a lot like I’m starting a new thread. This is no coincidence. There’s a lot of similarity between coroutines and multiple threads, as you’ll soon see.

Finally the iterator function itself. Since NULL isn’t a valid key, it initializes the key to NULL before yielding to the iterator context. If the iterator has no more nodes to visit, it doesn’t set the key, which can be detected when control returns.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    swapcontext(&it->yield, &it->coroutine);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->coroutine.uc_stack.ss_sp);
        free(it);
        return 0;
    }
}

That’s all it takes to create and operate a coroutine in C, provided you’re on a system with these XSI extensions.

Semaphores

Instead of a coroutine, we could just use actual threads and a couple of semaphores to synchronize them. This is a heavy implementation and also probably shouldn’t be used in practice, but at least it’s fully portable.

Here’s the structure definition:

struct tree_it {
    struct tree *t;
    char *k;
    char *v;
    sem_t visitor;
    sem_t main;
    pthread_t thread;
};

The main thread will wait on one semaphore and the iterator thread will wait on the other. This should sound very familiar.

The actual traversal function looks the same, but with sem_post() and sem_wait() as the yield.

static void
visit(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        sem_post(&it->main);
        sem_wait(&it->visitor);
        visit(t->left, it);
        visit(t->right, it);
    }
}

There’s a separate function to initialize the iterator context again.

static void *
thread_entrance(void *arg)
{
    struct tree_it *it = arg;
    sem_wait(&it->visitor);
    visit(it->t, it);
    sem_post(&it->main);
    return 0;
}

Creating the iterator only requires initializing the semaphores and creating the thread:

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->t = t;
    sem_init(&it->visitor, 0, 0);
    sem_init(&it->main, 0, 0);
    pthread_create(&it->thread, 0, thread_entrance, it);
    return it;
}

The iterator function looks just like the coroutine version.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    sem_post(&it->visitor);
    sem_wait(&it->main);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        pthread_join(it->thread, 0);
        sem_destroy(&it->main);
        sem_destroy(&it->visitor);
        free(it);
        return 0;
    }
}

Overall, this is almost identical to the coroutine version.

Coroutines using stack clashing

Finally I can tie this back into the topic at hand. Without either XSI extensions or Pthreads, we can (usually) create coroutines by abusing setjmp() and longjmp(). Technically this violates two of the C’s rules and relies on undefined behavior, but it generally works. This is not my own invention, and it dates back to at least 2010.

From the very beginning, C has provided a crude “exception” mechanism that allows the stack to be abruptly unwound back to a previous state. It’s a sort of non-local goto. Call setjmp() to capture an opaque jmp_buf object to be used in the future. This function returns 0 this first time. Hand that value to longjmp() later, even in a different function, and setjmp() will return again, this time with a non-zero value.

It’s technically unsuitable for coroutines because the jump is a one-way trip. The unwound stack invalidates any jmp_buf that was created after the target of the jump. In practice, though, you can still use these jumps, which is one rule being broken.

That’s where stack clashing comes into play. In order for it to be a proper coroutine, it needs to have its own stack. But how can we do that with these primitive C utilities? Extend the stack to overlap the heap, call setjmp() to capture a coroutine on it, then return. Generally we can get away with using longjmp() to return to this heap-allocated stack.

Here’s my iterator definition for this one. Like the XSI context struct, this has two jmp_buf “contexts.” The stack holds the iterator’s stack buffer so that it can be freed, and the gap field will be used to prevent the optimizer from spoiling our plans.

struct tree_it {
    char *k;
    char *v;
    char *stack;
    volatile char *gap;
    jmp_buf coroutine;
    jmp_buf yield;
};

The coroutine looks familiar again. This time the yield is performed with setjmmp() and longjmp(), just like swapcontext(). Remember that setjmp() returns twice, hence the branch. The longjmp() never returns.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        if (!setjmp(it->coroutine))
            longjmp(it->yield, 1);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

Next is the tricky part to cause the stack clash. First, allocate the new stack with malloc() so that we can get its address. Then use a local variable on the stack to determine how much the stack needs to grow in order to overlap with the allocation. Taking the difference between these pointers is illegal as far as the language is concerned, making this the second rule I’m breaking. I can imagine an implementation where the stack and heap are in two separate kinds of memory, and it would be meaningless to take the difference. I don’t actually have to imagine very hard, because this is actually how it used to work on the 8086 with its segmented memory architecture.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack = malloc(STACK_SIZE);
    char marker;
    char gap[&marker - it->stack - STACK_SIZE];
    it->gap = gap; // prevent optimization
    if (!setjmp(it->yield))
        coroutine_init(t, it);
    return it;
}

I’m using a variable-length array (VLA) named gap to indirectly control the stack pointer, moving it over the heap. I’m assuming the stack grows downward, since otherwise the sign would be wrong.

The compiler is smart and will notice I’m not actually using gap, and it’s happy to throw it away. In fact, it’s vitally important that I don’t touch it since the guard page, along with a bunch of unmapped memory, is actually somewhere in the middle of that array. I only want the array for its side effect, but that side effect isn’t officially supported, which means the optimizer doesn’t need to consider it in its decisions. To inhibit the optimizer, I store the array’s address where someone might potentially look at it, meaning the array has to exist.

Finally, the iterator function looks just like the others, again.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    if (!setjmp(it->yield))
        longjmp(it->coroutine, 1);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->stack);
        free(it);
        return 0;
    }
}

And that’s it: a nasty hack using a stack clash to create a context for a setjmp()+longjmp() coroutine.

Building and Installing Software in $HOME

For more than 5 years now I’ve kept a private “root” filesystem within my home directory under $HOME/.local/. Within are the standard /usr directories, such as bin/, include/, lib/, etc., containing my own software, libraries, and man pages. These are first-class citizens, indistinguishable from the system-installed programs and libraries. With one exception (setuid programs), none of this requires root privileges.

Installing software in $HOME serves two important purposes, both of which are indispensable to me on a regular basis.

This prevents me from installing packaged software myself through the system’s package manager. Building and installing the software myself in my home directory, without involvement from the system administrator, neatly works around this issue. As a software developer, it’s already perfectly normal for me to build and run custom software, and this is just an extension of that behavior.

In the most desperate situation, all I need from the sysadmin is a decent C compiler and at least a minimal POSIX environment. I can bootstrap anything I might need, both libraries and programs, including a better C compiler along the way. This is one major strength of open source software.

I have noticed one alarming trend: Both GCC (since 4.8) and Clang are written in C++, so it’s becoming less and less reasonable to bootstrap a C++ compiler from a C compiler, or even from a C++ compiler that’s more than a few years old. So you may also need your sysadmin to supply a fairly recent C++ compiler if you want to bootstrap an environment that includes C++. I’ve had to avoid some C++ software (such as CMake) for this reason.

In theory this is what /usr/local is all about. It’s typically the location for software not managed by the system’s package manager. However, I think it’s cleaner to put this in $HOME/.local, so long as other system users don’t need it.

For example, I have an installation of each version of Emacs between 24.3 (the oldest version worth supporting) through the latest stable release, each suffixed with its version number, under $HOME/.local. This is useful for quickly running a test suite under different releases.

$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...

Another example is NetHack, which I prefer to play with a couple of custom patches (Menucolors, wchar). The install to $HOME/.local is also captured as a patch.

$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install

Normally NetHack wants to be setuid (e.g. run as the “games” user) in order to restrict access to high scores, saves, and bones — saved levels where a player died, to be inserted randomly into other players’ games. This prevents cheating, but requires root to set up. Fortunately, when I install NetHack in my home directory, this isn’t a feature I actually care about, so I can ignore it.

Mutt is in a similar situation, since it wants to install a special setgid program (mutt_dotlock) that synchronizes mailbox access. All MUAs need something like this.

Everything described below is relevant to basically any modern unix-like system: Linux, BSD, etc. I personally install software in $HOME across a variety of systems and, fortunately, it mostly works the same way everywhere. This is probably in large part due to everyone standardizing around the GCC and GNU binutils interfaces, even if the system compiler is actually LLVM/Clang.

Configuring for $HOME installs

Out of the box, installing things in $HOME/.local won’t do anything useful. You need to set up some environment variables in your shell configuration (i.e. .profile, .bashrc, etc.) to tell various programs, such as your shell, about it. The most obvious variable is $PATH:

export PATH=$HOME/.local/bin:$PATH

Notice I put it in the front of the list. This is because I want my home directory programs to override system programs with the same name. For what other reason would I install a program with the same name if not to override the system program?

In the simplest situation this is good enough, but in practice you’ll probably need to set a few more things. If you install libraries in your home directory and expect to use them just as if they were installed on the system, you’ll need to tell the compiler where else to look for those headers and libraries, both for C and C++.

export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib

This is like the -I compiler option and the -L linker option, except you won’t need to use them explicitly. Some software uses pkg-config to determine its compiler and linker flags, and your home directory will contain some of the needed information. So set that up too:

export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig

Run-time linker

Finally, when you install libraries in your home directory, the run-time dynamic linker will need to know where to find them. There are three ways to deal with this:

  1. The crude, easy way: LD_LIBRARY_PATH.
  2. The elegant, difficult way: ELF runpath.
  3. Screw it, just statically link the bugger. (Not always possible.)

For the crude way, point the run-time linker at your lib/ and you’re done:

export LD_LIBRARY_PATH=$HOME/.local/lib

However, this is like using a shotgun to kill a fly. If you install a library in your home directory that is also installed on the system, and then run a system program, it may be linked against your library rather than the library installed on the system as was originally intended. This could have detrimental effects.

The precision method is to set the ELF “runpath” value. It’s like a per-binary LD_LIBRARY_PATH. The run-time linker uses this path first in its search for libraries, and it will only have an effect on that particular program/library. This also applies to dlopen().

Some software will configure the runpath by default, but usually you need to configure this yourself with the linker -rpath option in LDFLAGS. It’s used directly like this:

$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux

Verify with readelf:

$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]

ELF supports a special $ORIGIN “variable” set to the binary’s location. This allows the program and associated libraries to be installed anywhere without changes, so long as they have the same relative position to each other . (Note the quotes to prevent shell interpolation.)

$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux

There is one situation where runpath won’t work: when you want a system-installed program to find a home directory library with dlopen() — e.g. as an extension to that program. You either need to ensure it uses a relative or absolute path (i.e. the argument to dlopen() contains a slash) or you must use LD_LIBRARY_PATH.

Personally, I always use the Worse is Better LD_LIBRARY_PATH shotgun. Occasionally it’s caused some annoying issues, but the vast majority of the time it gets the job done with little fuss. This is just my personal development environment, after all, not a production server.

Manual pages

Another potentially tricky issue is man pages. When a program or library installs a man page in your home directory, it would certainly be nice to access it with man <topic> just like it was installed on the system. Fortunately, Debian and Debian-derived systems, using a mechanism I haven’t yet figured out, discover home directory man pages automatically without any assistance. No configuration needed.

It’s more complicated on other systems, such as the BSDs. You’ll need to set the MANPATH variable to include $HOME/.local/share/man. It’s unset by default and it overrides the system settings, which means you need to manually include the system paths. The manpath program can help with this … if it’s available.

export MANPATH=$HOME/.local/share/man:$(manpath)

I haven’t figured out a portable way to deal with this issue, so I mostly ignore it.

How to install software in $HOME

While I’ve poo-pooed autoconf in the past, the standard configure script usually makes it trivial to build and install software in $HOME. The key ingredient is the --prefix option:

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

Most of the time it’s that simple! If you’re linking against your own libraries and want to use runpath, it’s a little more complicated:

$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"

For CMake, there’s CMAKE_INSTALL_PREFIX:

$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..

The CMake builds I’ve seen use ELF runpath by default, and no further configuration may be required to make that work. I’m sure that’s not always the case, though.

Some software is just a single, static, standalone binary with everything baked in. It doesn’t need to be given a prefix, and installation is as simple as copying the binary into place. For example, Enchive works like this:

$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin

Some software uses its own unique configuration interface. I can respect that, but it does add some friction for users who now have something additional and non-transferable to learn. I demonstrated a NetHack build above, which has a configuration much more involved than it really should be. Another example is LuaJIT, which uses make variables that must be provided consistently on every invocation:

$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install

(You can use the “install” target to both build and install, but I wanted to illustrate the repetition of PREFIX.)

Some libraries aren’t so smart about pkg-config and need some handholding — for example, ncurses. I mention it because it’s required for both Vim and Emacs, among many others, so I’m often building it myself. It ignores --prefix and needs to be told a second time where to install things:

$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH

Another issue is that a whole lot of software has been hardcoded for ncurses 5.x (i.e. ncurses5-config), and it requires hacks/patching to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x for this reason.

Learning through experience

I could go on and on like this, discussing the quirks for the various libraries and programs that I use. Over the years I’ve gotten used to many of these issues, committing the solutions to memory. Unfortunately, even within the same version of a piece of software, the quirks can change between major operating system releases, so I’m continuously learning my way around new issues. It’s really given me an appreciation for all the hard work that package maintainers put into customizing and maintaining software builds to fit properly into a larger ecosystem.

Switching to the Mutt Email Client

Note: The way I manage my email wouldn’t really work for most people, so don’t read this as a recommendation. This is just a discussion of how I prefer to use email.

It was almost four years ago I switched from webmail to a customized email configuration based on Notmuch and Emacs. Notmuch served as both as a native back-end that provided indexing and tagging, as well as a front-end, written in Emacs Lisp. It dramatically improved my email experience, and I wished I had done it earlier. I’ve really enjoyed having so much direct control over my email.

However, I’m always fiddling with things — fiddling feels a lot more productive than it actually is — and last month I re-invented my email situation, this time switching to a combination of Mutt, Vim, mu, and tmux. The entirety of my email interface now resides inside a terminal, and I’m enjoying it even more. I feel I’ve “leveled up” again in my email habits.

On the server-side I also switched from Exim to Postfix and procmail, making the server configuration a whole lot simpler. Including SpamAssassin, it’s just three lines added to the default Debian configuration. It leaves a lot less room for error, and I could rebuild it from scratch with little trouble if there was an emergency. My previous configuration required quite a bit of system configuration, such as relying on incron to sort incoming mail, particularly spam, but procmail now does this job more cleanly.

Towards Robustness

Over the years I’ve gotten less patient when it comes to dealing with breaking changes in software, and I’ve gotten more conservative about system stability. Continuously updating my configurations and habits to the latest software changes was an interesting challenge earlier in my career, but today there are much better uses of my time. Debian Stable, my preferred operating system, runs at pretty much the perfect pace for me.

Following these changing preferences, one of the biggest motivations for my recent email change was to make my email setup more robust and stable. Until now, email was tied tightly to Emacs, with a configuration drawing directly from MELPA, pulling in the bleeding edge version of every package I use. Breaking changes arrive at unexpected times, and occasionally the current version of a package temporarily doesn’t work. Usually it’s because the developer pushed a bad commit right before the latest MELPA build, and so the package is broken for a few hours or days. I’ve been guilty of this myself. MELPA Stable is intended to address these issues, but it seems to break more often than normal MELPA. For example, at the time of this writing, Evil is not installable via MELPA Stable due to an unmet dependency.

Tying something as vital as email to this Rube Goldberg machine made me nervous. Access to my email depended on a number of independent systems of various levels of stability to mostly work correctly. My switch to Mutt cut this down to just a couple of very stable systems.

format=flowed

I’ve long believed HTML email is an abomination that should never have been invented. Text is the ideal format for email, and there are a number of specifications to make it work well across different systems. One of those standards is RFC 3676, colloquially named format=flowed, or just f=f.

Messages encoded with f=f allow mail clients to safely reflow the paragraphs to nicely fit the user’s display, whether that display be thinner or wider than the sender’s original message. It’s also completely compatible with mail clients that don’t understand format=flowed, which will display the message as the sender originally wrapped it.

The gist of f=f is that messages can have both “soft” and “hard” line breaks. If a line ends with a space, then it’s a soft line break. The mail client can safely reflow lines separated by a soft line break. Without the trailing space, it’s a hard line break, which prohibits flowing with the next line. The last line of a paragraph ends with a hard line break. It’s also used for text that shouldn’t reflow, such as code samples.

I’ll illustrate using an underscore in place of a space, so that you can see it:

This is a message in the format=flowed style, allowing_
mail clients to flow this message nicely in displays of_
different widths.

> This is an example of a quote block in a message,_
> which is supported by the format=flowed specification.
>> It also supports nested quote blocks, which means_
>> this paragraph won't flow into the previous.

The RFC covers edge cases that require special “space-stuffing” rules, but, when editing a text email in an editor, you only need to think about soft and hard line breaks. In my case, Mutt takes care of the rest of the details.

Unfortunately Emacs’s lacks decent support for f=f, though I’m sure a minor mode could be written to make it work well. On the other hand, Vim has been playing an increasing role in my day-to-day editing, and it has excellent built-in support for f=f. Since I’m now using Vim to compose all of my email, I get it for free.

First, I tell Mutt that I want to use f=f in my .muttrc:

set text_flowed

Then in Vim, I add the w flag to formatoptions, which tells it to wrap paragraphs using soft line breaks.

set fo+=w

If I want to inspect my f=f formatting, I temporarily enable the list option, which displays a $ for all newlines.

set list

Although few people would notice a difference, I feel a little bad for not using f=f all these years! A few people may have endured some ugly, non-flowing emails from me. My only condolance is that at least it wasn’t HTML.

It’s not all roses, though. When I reply to a message, Mutt doesn’t insert the quoted text as f=f into my reply, so I have to massage it into f=f myself. Also, just as GitHub doesn’t support Markdown in email responses, neither does it support f=f. When I reply to issues by email, GitHub won’t nicely reflow my carefully crafted f=f message, needlessly making email responses an inferior option.

Features unneeded

One reason I didn’t choose this particular email arrangement 4 years ago was that PGP support was one of my prime requirements. Mutt has solid PGP support, but, with a Maildir setup (i.e. not IMAP), I’d have to use the key on the server, which was out of the question. Since I no longer care about PGP, my email requirements are more relaxed.

Over the years wasn’t making much use of Notmuch’s tagging system. I only used two tags: “unread” and “inbox” (e.g. read, but still needs attention). Otherwise I’d use Notmuch’s powerful search to find what I wanted. I still needed to keep track of the tags I was using, so the Notmuch index, nearly as large as the email messages themselves, became part of my mail backup.

The Maildir format itself supports some flags: passed (P), replied (R), seen (S), trashed (T), draft (D), and flagged (F). These are stored in the message’s filename. In my new configuration, the “seen” tag (inversely) takes the place of Notmuch’s “unread” tag. The “flagged” tag takes place of the “inbox” tag. Normally in Mutt you’d use mailboxes — i.e. Maildir subdirectories — for something like this, but I prefer all my mail to sit in one big bucket. Search, don’t sort.

Since the two flags are part of the filename, I no longer need to include a tag database (i.e. the entire Notmuch index) in the backup, and my mail backups are much smaller. I could continue to use Notmuch for searching, but I’ve settled on mu instead. When I perform a search, mu writes the results to a temporary Maildir using symbolic links, which I visit with Mutt. The mu index is transient and doesn’t need to be backed up.

Mu also manages my contacts alias list. It can produce a Mutt-style alias file based on the contents of my Maildir:

mu cfind --format=mutt-alias > aliases

It’s been really nice to have all my email sitting around as nothing more than a big pile of files like this. I’ve begun writing little scripts to harvest data from it, too.

Configuration files

As with all my personal configuration files, you can see my .muttrc online. The first few weeks I was tweaking this file hourly, but I’ve now got it basically the way I want.

Web Scraping into an E-book with BeautifulSoup and Pandoc

I recently learned how to use BeautifulSoup, a Python library for manipulating HTML and XML parse trees, and it’s been a fantastic addition to my virtual toolbelt. In the past when I’ve needed to process raw HTML, I’ve tried nasty hacks with Unix pipes, or routing the content through a web browser so that I could manipulate it via the DOM API. None of that worked very well, but now I finally have BeautifulSoup fill that gap. It’s got a selector interface and, except for rendering, it’s is basically as comfortable with HTML as JavaScript.

Today’s problem was that I wanted to read a recommended online book called Interviewing Leather, a story set “in a world where caped heroes fight dastardly villains on an everyday basis.” I say “online book” because the 39,403 word story is distributed as a series of 14 blog posts. I’d rather not read it on the website in a browser, instead preferring it in e-book form where it’s more comfortable. The last time I did this, I manually scraped the entire book into Markdown, spent a couple of weeks editing it for mistakes, and finally sent the Markdown to Pandoc to convert into an e-book.

For this book, I just want a quick-and-dirty scrape in order to shift formats. I’ve never read it and I may not even like it, so I definitely don’t want to spend much time on the conversation. Despite having fun with typing lately, I’d also prefer to keep all the formating — italics, etc. — without re-entering it all manually.

Fortunately Pandoc can consume HTML as input, so, in theory, I can feed it the original HTML and preserve all of the original markup. The challenge is that the HTML is spread across 14 pages surrounded by all the expected blog cruft. I need some way to extract the book content from each page, concatenate it together along with chapter headings, and send the result to Pandoc. Enter BeautifulSoup.

First, I need to construct the skeleton HTML document. Rather than code my own HTML, I’m going to build it with BeautifulSoup. I start by creating a completely empty document and adding a doctype to it.

from bs4 import BeautifulSoup, Doctype

doc = BeautifulSoup()
doc.append(Doctype('html'))

Next I create the html root element, then add the head and body elements. I also add a title element. The original content has fancy Unicode markup — left and right quotation marks, em dash, etc. — so it’s important to declare the page as UTF-8, since otherwise these characters are likely to be interpreted incorrectly. It always feels odd declaring the encoding within the content being encoded, but that’s just the way things are.

html = doc.new_tag('html', lang='en-US')
doc.append(html)
head = doc.new_tag('head')
html.append(head)
meta = doc.new_tag('meta', charset='utf-8')
head.append(meta)
title = doc.new_tag('title')
title.string = 'Interviewing Leather'
head.append(title)
body = doc.new_tag('body')
html.append(body)

If I print(doc.prettify()) then I see the skeleton I want:

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <title>
   Interviewing Leather
  </title>
 </head>
 <body>
 </body>
</html>

Next, I assemble a list of the individual blog posts. When I was actually writing the script, I first downloaded them locally with my favorite download tool, curl, and ran the script against local copies. I didn’t want to hit the web server each time I tested. (Note: I’ve truncated these URLs to fit in this article.)

chapters = [
    "https://banter-latte.com/2007/06/26/...",
    "https://banter-latte.com/2007/07/03/...",
    "https://banter-latte.com/2007/07/10/...",
    "https://banter-latte.com/2007/07/17/...",
    "https://banter-latte.com/2007/07/24/...",
    "https://banter-latte.com/2007/07/31/...",
    "https://banter-latte.com/2007/08/07/...",
    "https://banter-latte.com/2007/08/14/...",
    "https://banter-latte.com/2007/08/21/...",
    "https://banter-latte.com/2007/08/28/...",
    "https://banter-latte.com/2007/09/04/...",
    "https://banter-latte.com/2007/09/20/...",
    "https://banter-latte.com/2007/09/25/...",
    "https://banter-latte.com/2007/10/02/..."
]

I visit a few of these pages in my browser to determine which part of the page I want to extract. I want to look closely enough to see what I’m doing, but not too closely as to not spoil myself! Right clicking the content in the browser and selecting “Inspect Element” (Firefox) or “Inspect” (Chrome) pops up a pane to structurally navigate the page. “View Page Source” would work, too, especially since this is static content, but I find the developer pane easier to read. Plus it hides most of the content, revealing only the structure.

The content is contained in a div with the class entry-content. I can use a selector to isolate this element and extract its child p elements. However, it’s not quite so simple. Each chapter starts with a bit of commentary that’s not part of the book, and I don’t want to include in my extract. It’s separated from the real content by an hr element. There’s also a footer below another hr element, likely put there by someone who wasn’t paying attention to the page structure. It’s not quite the shining example of semantic markup, but it’s regular enough I can manage.

<body>
  <main class="site-main">
    <div class="entry-body">
      <div class="entry-content">
        <p>A little intro.</p>
        <p>Some more intro.</p>
        <hr/>
        <p>Actual book content.</p>
        <p>More content.</p>
        <hr/>
        <p>Footer navigation junk.</p>
      </div>
    </div>
  </main>
</body>

The next step is visiting each of these pages. I use enumerate since I want the chapter numbers when inserting h1 chapter elements. Pandoc will use these to build the table of contents.

for i, chapter in enumerate(chapters):
    # Construct h1 for the chapter
    header = doc.new_tag('h1')
    header.string = 'Chapter %d' % (i + 1,)
    body.append(header)

Next grab the page content using urllib and parse it with BeautifulSoup. I’m using a selector to locate the div with the book content.

    # Load chapter content
    with urllib.request.urlopen(chapter) as url:
        page = BeautifulSoup(url)
    content = page.select('.entry-content')[0]

Finally I iterate over the child elements of the div.entry-content element. I keep a running count of the hr element and only extract content when we’ve seen exactly one hr element.

    # Append content between hr elements
    hr_count = 0
    for child in content.children:
        if (child.name == 'hr'):
            hr_count += 1
        elif (child.name == 'p' and hr_count == 1):
            child.attrs = {}
            if (child.string == '#'):
                body.append(doc.new_tag('hr'))
            else:
                body.append(child)

If it’s a p element, I copy it into the output document, taking a moment to strip away any attributes present on the p tag, since, for some reason, some of these elements have old-fashioned alignment attributes in the original content.

The original content also uses the text “#” by itself in a p to separate sections rather than using the appropriate markup. Despite being semantically incorrect, I’m thankful for this since more hr elements would have complicated matters further. I convert these to the correct markup for the final document.

Finally I pretty print the result:

print(doc.prettify())

Alternatively I could pipe it through tidy.

$ python3 extract.py | tidy -indent -utf8 > output.html

A brief inspection with a browser indicates that everything seems to have come out correctly. I won’t know for sure, though, until I actually read through the whole book. Finally I have Pandoc perform the conversion.

$ pandoc -t epub3 -o output.epub output.html 

And that’s it! It’s ready to read offline in my e-book reader of choice. The crude version of my script took around 15–20 minutes to write and test, so I had an e-book conversion in under 30 minutes. That’s about as long as I was willing to spend to get it. Tidying the script up for this article took a lot longer.

I don’t have permission to share the resulting e-book, but I can share my script so that you can generate your own, at least as long as it’s hosted at the same place with the same structure.

The Adversarial Implementation

When coding against a standard, whether it’s a programming language specification or an open API with multiple vendors, a common concern is the conformity of a particular construct to the standard. This cannot be determined simply by experimentation, since a piece of code may work correctly due only to the specifics of a particular implementation. It works today with this implementation, but it may not work tomorrow or with a different implementation. Sometimes an implementation will warn about the use of non-standard behavior, but this isn’t always the case.

When I’m reasoning about whether or not something is allowed, I like to imagine an adversarial implementation. If the standard allows some freedom, this implementation takes an imaginative or unique approach. It chooses non-obvious interpretations with possibly unexpected, but valid, results. This is nearly the opposite of djb’s hypothetical boringcc, though some of the ideas are similar.

Many argue that this is already the case with modern C and C++ optimizing compilers. Compiler writers are already creative with the standard in order to squeeze out more performance, even if it’s at odds with the programmer’s actual intentions. The most prominent example in C and C++ is strict aliasing, where the optimizer is deliberately blinded to certain kinds of aliasing because the standard allows it to be, eliminating some (possibly important) loads. This happens despite the compiler’s ability to trivially prove that two particular objects really do alias.

I want to be clear that I’m not talking about the nasal daemon kind of creativity. That’s not a helpful thought experiment. What I mean is this: Can I imagine a conforming implementation that breaks any assumptions made by the code?.

In practice, compilers typically have to bridge multiple specifications: the language standard, the platform ABI, and operating system interface (process startup, syscalls, etc.). This really ties its hands on how creative it can be with any one of the specifications. Depending on the situation, the imaginary adversarial implementation isn’t necessarily running on any particular platform. If our program is expected to have a long life, useful for many years to come, we should avoid making too many assumptions about future computers and imagine an adversarial compiler with few limitations.

C example

Take this bit of C:

printf("%d", sizeof(foo));

The printf function is variadic, and it relies entirely on the format string in order to correctly handle all its arguments. The %d specifier means that its matching argument is of type int. The result of the sizeof operator is an integer of type size_t, which has a different sign and may even be a different size.

Typically this code will work just fine. An int and size_t are generally passed the same way, the actual value probably fits in an int, and two’s complement means the signedness isn’t an issue due to the value being positive. From the printf point of view, it typically can’t detect that the type is wrong, so everything works by chance. In fact, it’s hard to imagine a real situation where this wouldn’t work fine.

However, this still undefined behavior — a scenario where a creative adversarial implementation can break things. In this case there are a few options for an adversarial implementation:

  1. Arguments of type int and size_t are passed differently, so printf will load the argument it from the wrong place.
  2. The implementation doesn’t use two’s complement and even small positive values have different bit representations.
  3. The type of foo is given crazy padding for arbitrary reasons that makes it so large it doesn’t fit in an int.

What’s interesting about #1 is that this has actually happened. For example, here’s a C source file.

float foo(float x, int y);

float
bar(int y)
{
    return foo(0.0f, y);
}

And in another source file:

float
foo(int x, int y)
{
    (void)x;  // ignore x
    return y * 2.0f;
}

The type of argument x differs between the prototype and the definition, which is undefined behavior. However, since this argument is ignored, this code will still work correctly on many different real-world computers, particularly where float and int arguments are passed the same way (i.e. on the stack).

However, in 2003 the x86-64 CPU arrived with its new System V ABI. Floating point and integer arguments were now passed differently, and the types of preceding arguments mattered when deciding which register to use. Some constructs that worked fine, by chance, prior to 2003 would soon stop working due to what may have seemed like an adversarial implementation years before.

Python example

Let’s look at some Python. This snippet opens a file a million times without closing any handles.

for i in range(1, 1000000):
    f = open("/dev/null", "r")

Assuming you have a /dev/null, this code will work fine without throwing any exceptions on CPython, the most widely used Python implementation. CPython uses a deterministic reference counting scheme, and the handle is automatically closed as soon as its variable falls out of scope. It’s like having an invisible f.close() at the end of the block.

However, this code is incorrect. The deterministic handle closing an implementation behavior, not part of the specification. The operating system limits the number of files a process can have open at once, and there’s a risk that this resource will run out even though none of those handles are reachable. Imagine an adversarial Python implementation trying to break this code. It could sufficiently delay garbage collection, or even have infinite memory, omitting garbage collection altogether.

Like before, such an implementation eventually did come about: PyPy, a Python implementation written in Python with a JIT compiler. It uses (by default) something closer to mark-and-sweep, not reference counting, and those handles are left open until the next collection.

>>>> for i in range(1, 1000000):
....     f = open("/dev/null", "r")
.... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IOError: [Errno 24] Too many open files: '/dev/null'

A tool for understanding specifications

This fits right in with a broader method of self-improvement: Occasionally put yourself in the implementor’s shoes. Think about what it would take to correctly implement the code that you write, either as a language or the APIs that you call. On reflection, you may find that some of those things that seem cheap may not be. Your assumptions may be reasonable, but not guaranteed. (Though it may be that “reasonable” is perfectly sufficient for your situation.)

An adversarial implementation is one that challenges an assumption you’ve taken for granted by turning it on its head.

Two Games with Monte Carlo Tree Search

Monte Carlo tree search (MCTS) is the most impressive game artificial intelligence I’ve ever used. At its core it simulates a large number of games (playouts), starting from the current game state, using random moves for each player. Then it simply picks the move where it won most often. This description is sufficient to spot one of its most valuable features: MCTS requires no knowledge of strategy or effective play. The game’s rules — enough to simulate the game — are all that’s needed to allow the AI to make decent moves. Expert knowledge still makes for a stronger AI, but, more many games, it’s unnecessary to construct a decent opponent.

A second valuable feature is that it’s easy to parallelize. Unlike alpha-beta pruning, which doesn’t mix well with parallel searches of a Minimax tree, Monte Carlo simulations are practically independent and can be run in parallel.

Finally, the third valuable feature is that the search can be stopped at any time. The completion of any single simulation is as good a stopping point as any. It could be due to a time limit, a memory limit, or both. In general, the algorithm converges to a best move rather than suddenly discovering it. The good moves are identified quickly, and further simulations work to choose among them. More simulations make for better moves, with exponentially diminishing returns. Contrasted with Minimax, stopping early has the risk that the good moves were never explored at all.

To try out MCTS myself, I wrote two games employing it:

They’re both written in C, for both unix-like and Windows, and should be easy to build. I challenge you to beat them both. The Yavalath AI is easier to beat due to having blind spots, which I’ll discuss below. The Connect Four AI is more difficult and will likely take a number of tries.

Connect Four

MCTS works very well with Connect Four, and only requires modest resources: 32MB of memory to store the results of random playouts, and 500,000 game simulations. With a few tweaks, it can even be run in DOSBox. It stops when it hits either of those limits. In theory, increasing both would make for stronger moves, but in practice I can’t detect any difference. It’s like computing pi with Monte Carlo, where eventually it just runs out of precision to make any more progress.

Based on my simplified description above, you might wonder why it needs all that memory. Not only does MCTS need to track its win/loss ratio for each available move from the current state, it tracks the win/loss ratio for moves in the states behind those moves. A large chunk of the game tree is kept in memory to track all of the playout results. This is why MCTS needs a lot more memory than Minimax, which can discard branches that have been searched.

A convenient property of this tree is that the branch taken in the actual game can be re-used in a future search. The root of the tree becomes the node representing the taken game state, which has already seen a number of playouts. Even better, MCTS is weighted towards exploring good moves over bad moves, and good moves are more likely to be taken in the real game. In general, a significant portion of the tree gets to be reused in a future search.

I’m going to skip most of the details of the algorithm itself and focus on my implementation. Other articles do a better job at detailing the algorithm than I could.

My Connect Four engine doesn’t use dynamic allocation for this tree (or at all). Instead it manages a static buffer — an array of tree nodes, each representing a game state. All nodes are initially chained together into a linked list of free nodes. As the tree is built, nodes are pulled off the free list and linked together into a tree. When the game advances to the next state, nodes on unreachable branches are added back to the free list.

If at any point the free list is empty when a new node is needed, the current search aborts. This is the out-of-memory condition, and no more searching can be performed.

/* Connect Four is normally a 7 by 6 grid. */
#define CONNECT4_WIDTH  7
#define CONNECT4_HEIGHT 6

struct connect4_node {
    uint32_t next[CONNECT4_WIDTH];      // "pointer" to next node
    uint32_t playouts[CONNECT4_WIDTH];  // number of playouts
    float    score[CONNECT4_WIDTH];     // pseudo win/loss ratio
};

Rather than native C pointers, the structure uses 32-bit indexes into the master array. This saves a lot of memory on 64-bit systems, and the structure is the same size no matter the pointer size of the host. The next field points to the next state for the nth move. Since 0 is a valid index, -1 represents null (CONNECT4_NULL).

Each column is a potential move, so there are CONNECT4_WIDTH possible moves at any given state. Each move has a floating point score and a total number of playouts through that move. In my implementation, the search can also halt due to an overflow in a playout counter. The search can no longer be tracked in this representation, so it has to stop. This generally only happens when the game is nearly over and it’s grinding away on a small number of possibilities.

Note that the actual game state (piece positions) is not tracked in the node structure. That’s because it’s implicit. We know the state of the game at the root, and simulating the moves while descending the tree will keep track of the board state at the current node. That’s more memory savings.

The state itself is a pair of bitboards, one for each player. Each position on the grid gets a bit on each bitboard. The bitboard is very fast to manipulate, and win states are checked with just a handful of bit operations. My intention was to make playouts as fast as possible.

struct connect4_ai {
    uint64_t state[2];         // game state at root (bitboard)
    uint64_t rng[2];           // random number generator state
    uint32_t nodes_available;  // total number of nodes available
    uint32_t nodes_allocated;  // number of nodes in the tree
    uint32_t root;             // "pointer" to root node
    uint32_t free;             // "pointer" to free list
    int turn;                  // whose turn (0 or 1) at the root?
};

The nodes_available and nodes_allocated are not necessary for correctness nor speed. They’re useful for diagnostics and debugging.

All the functions that operate on these two structures are straightforward, except for connect4_playout, a recursive function which implements the bulk of MCTS. Depending on the state of the node it’s at, it does one of two things:

That’s pretty much all there is to it.

Yavalath

Yavalath is a board game invented by a computer program. It’s a pretty fascinating story. The depth and strategy are disproportionately deep relative to its dead simple rules: Get four in a row without first getting three in a row. The game revolves around forced moves.

The engine is structured almost identically to the Connect Four engine. It uses 32-bit indexes instead of pointers. The game state is a pair of bitboards, with end-game masks computed at compile time via metaprogramming. The AI allocates the tree from a single, massive buffer — multiple GBs in this case, dynamically scaled to the available physical memory. And the core MCTS function is nearly identical.

One important difference is that identical game states — states where the pieces on the board are the same, but the node was reached through a different series of moves — are coalesced into a single state in the tree. This state deduplication is done through a hash table. This saves on memory and allows multiple different paths through the game tree to share playouts. It comes at a cost of including the game state in the node (so it can be identified in the hash table) and reference counting the nodes (since they might have more than one parent).

Unfortunately the AI has blind spots, and once you learn to spot them it becomes easy to beat consistently. It can’t spot certain kinds of forced moves, so it always falls for the same tricks. The official Yavalath AI is slightly stronger than mine, but has a similar blindness. I think MCTS just isn’t quite a good fit for Yavalath.

The AI’s blindness is caused by shallow traps, a common problem for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is a branch in the game tree where the game will abruptly end in a small number of turns. If the random tree search doesn’t luckily stumble upon a trap during its random traversal, it can’t take it into account in its final decision. A skilled player will lead the game towards one of these traps, and the AI will blunder along, not realizing what’s happened until its too late.

I almost feel bad for it when this happens. If you watch the memory usage and number of playouts, once it falls into a trap, you’ll see it using almost no memory while performing a ton of playouts. It’s desperately, frantically searching for a way out of the trap. But it’s too late, little AI.

Another Tool in the Toolbelt

I’m really happy to have sunk a couple weekends into playing with MCTS. It’s not always a great fit, as seen with Yavalath, but it’s a really neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready to use it should I run into an appropriate problem in the future.

My Journey with Touch Typing and Vim

Given the title, the publication date of this article is probably really confusing. This was deliberate.

Three weeks ago I made a conscious decision to improve my typing habits. You see, I had a dirty habit. Despite spending literally decades typing on a daily basis, I’ve been a weak typist. It wasn’t exactly finger pecking, nor did it require looking down at the keyboard as I typed, but rather a six-finger dance I developed organically over the years. My technique was optimized towards Emacs’ frequent use of CTRL and ALT combinations, avoiding most of the hand scrunching. It was fast enough to keep up with my thinking most of the time, but was ultimately limiting due to its poor accuracy. I was hitting the wrong keys far too often.

My prime motivation was to learn Vim — or, more specifically, to learn modal editing. Lots of people swear by it, including people whose opinions I hold in high regard. The modal editing community is without a doubt larger than the Emacs community, especially since, thanks to Viper and Evil, a subset of the Emacs community is also part of the modal editing community. There’s obviously something significantly valuable about it, and I wanted to understand what that was.

But I was a lousy typist who couldn’t hit the right keys often enough to make effective use of modal editing. I would need to learn touch typing first.

Touch typing

How would I learn? Well, the first search result for “online touch typing course” was Typing Club, so that’s what I went with. By the way, here’s my official review: “Good enough not to bother checking out the competition.” For a website it’s pretty much the ultimate compliment, but it’s not exactly the sort of thing you’d want to hear from your long-term partner.

My hard rule was that I would immediately abandon my old habits cold turkey. Poor typing is a bad habit just like smoking, minus the cancer and weakened sense of smell. It was vital that I unlearn all that old muscle memory. That included not just my six-finger dance, but also my NetHack muscle memory. NetHack uses “hjkl” for navigation just like Vim. The problem was that I’d spent a couple hundred hours in NetHack over the past decade with my index finger on “h”, not the proper home row location. It was disorienting to navigate around Vim initally, like riding a bicycle with inverted controls.

Based on reading other people’s accounts, I determined I’d need several days of introductory practice where I’d be utterly unproductive. I took a three-day weekend, starting my touch typing lessons on a Thursday evening. Boy, they weren’t kidding about it being slow going. It was a rough weekend. When checking in on my practice, my wife literally said she pitied me. Ouch.

By Monday I was at a level resembling a very slow touch typist. For the rest of the first week I followed all the lessons up through the number keys, never progressing past an exercise until I had exceeded the target speed with at least 90% accuracy. This was now enough to get me back on my feet for programming at a glacial, frustrating pace. Programming involves a lot more numbers and symbols than other kinds of typing, making that top row so important. For a programmer, it would probably be better for these lessons to be earlier in the series.

For that first week I mostly used Emacs while I was finding my feet (or finding my fingers?). That’s when I experienced first hand what all these non-Emacs people — people who I, until recently, considered to be unenlightened simpletons — had been complaining about all these years: Pressing CTRL and ALT key combinations from the home row is a real pain in in the ass! These complaints were suddenly making sense. I was already seeing the value of modal editing before I even started really learning Vim. It made me look forward to it even more.

During the second week of touch typing I went though Derek Wyatt’s Vim videos and learned my way around the :help system enough to bootstrap my Vim education. I then read through the user manual, practicing along the way. I’ll definitely have to pass through it a few more times to pick up all sorts of things that didn’t stick. This is one way that Emacs and Vim are a lot alike.

Update: Practical Vim: Edit Text at the Speed of Thought was recommended in the comments, and it’s certainly a better place to start than the Vim user manual. Unlike the manual, it’s opinionated and focuses on good habits, which is exactly what a newbie needs.

One of my rules when learning Vim was to resist the urge to remap keys. I’ve done it a lot with Emacs: “Hmm, that’s not very convenient. I’ll change it.” It means my Emacs configuration is fairly non-standard, and using Emacs without my configuration is like using an unfamiliar editor. This is both good and bad. The good is that I’ve truly changed Emacs to be my editor, suited just for me. The bad is that I’m extremely dependent on my configuration. What if there was a text editing emergency?

With Vim as a sort of secondary editor, I want to be able to fire it up unconfigured and continue to be nearly as productive. A pile of remappings would prohibit this. In my mind this is like a form of emergency preparedness. Other people stock up food and supplies. I’m preparing myself to sit at a strange machine without any of my configuration so that I can start the rewrite of the software lost in the disaster, so long as that machine has vi, cc, and make. If I can’t code in C, then what’s the point in surviving anyway?

The other reason is that I’m just learning. A different mapping might seem more appropriate, but what do I know at this point? It’s better to follow the beaten path at first, lest I form a bunch of bad habits again. Trust in the knowledge of the ancients.

Future directions

I am absolutely sticking with modal editing for the long term. I’m really enjoying it so far. At three weeks of touch typing and two weeks of modal editing, I’m around 80% caught back up with my old productivity speed, but this time I’ve got a lot more potential for improvement.

For now, Vim will continue taking over more and more of my text editing work. My last three articles were written in Vim. It’s really important to keep building proficiency. I still rely on Emacs for email and for syndication feeds, and that’s not changing any time soon. I also really like Magit as a Git interface. Plus I don’t want to abandon years of accumulated knowledge and leave the users of my various Emacs packages out to dry. Ultimately I believe will end up using Evil, to get what seems to be the best of both worlds: modal editing and Emacs’ rich extensibility.

How to Write Portable C Without Complicating Your Build

Suppose you’re writing a non-GUI C application intended to run on a number of operating systems: Linux, the various BSDs, macOS, classical unix, and perhaps even something as exotic as Windows. It might sound like a rather complicated problem. These operating systems have slightly different interfaces (or very different in one case), and they run different variants of the standard unix tools — a problem for portable builds.

With some up-front attention to detail, this is actually not terribly difficult. Unix-like systems are probably the least diverse and least buggy they’ve ever been. Writing portable code is really just a matter of coding to the standards and ignoring extensions unless absolutely necessary. Knowing what’s standard and what’s extension is the tricky part, but I’ll explain how to find this information.

You might be tempted to reach for an overly complicated solution such as GNU Autoconf. Sure, it creates a configure script with the familiar, conventional interface. This has real value. But do you really need to run a single-threaded gauntlet of hundreds of feature/bug tests for things that sometimes worked incorrectly in some weird unix variant back in the 1990s? On a machine with many cores (parallel build, -j), this may very well be the slowest part of the whole build process.

For example, the configure script for Emacs checks that the compiler supplies stdlib.h, string.h, and getenv — things that were standardized nearly 30 years ago. It also checks for a slew of POSIX functions that have been standard since 2001.

There’s a much easier solution: Document that the application requires, say, C99 and POSIX.1-2001. It’s the responsibility of the person building the application to supply these implementations, so there’s no reason to waste time testing for it.

How to code to the standards

Suppose there’s some function you want to use, but you’re not sure if it’s standard or an extension. Or maybe you don’t know what standard it comes from. Luckily the man pages document this stuff very well, especially on Linux. Check the friendly “CONFORMING TO” section. For example, look at getenv(3). Here’s what that section has to say:

CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.

This says this function comes from the original C standard. It’s always available on anything that claims to be a C implementation. The man page also documents secure_getenv(), which is a GNU extension: to be avoided in anything intended to be portable.

What about sleep(3)?

CONFORMING TO
    POSIX.1-2001.

This function isn’t part of standard C, but it’s available on any system claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the program needs to run on an operating system not implementing this POSIX standard (i.e. Windows), you’ll need to call an alternative function, probably inside a different #if .. #endif branch. More on this in a moment.

If you’re coding to POSIX, you must define the _POSIX_C_SOURCE feature test macro to the standard you intend to use prior to any system header includes:

A POSIX-conforming application should ensure that the feature test macro _POSIX_C_SOURCE is defined before inclusion of any header.

For example, to properly access POSIX.1-2001 functions in your application, define _POSIX_C_SOURCE to 200112L. With this defined, it’s safe to assume access to all of C and everything from that standard of POSIX. You can do this at the top of your sources, but I personally like the tidiness of a global config.h that gets included before everything.

How to create a portable build

So you’ve written clean, portable C to the standards. How do you build this application? The natural choice is make. It’s available everywhere and it’s part of POSIX.

Again, the tricky part is teasing apart the standard from the extension. I’m a long-time sinner in this regard, having far too often written Makefiles that depend on GNU Make extensions. This is a real pain when building programs on systems without the GNU utilities. I’ve been making amends (and finding some bugs as a result).

No implementation makes the division clear in its documentation, and especially don’t bother looking at the GNU Make manual. Your best resource is the standard itself. If you’re already familiar with make, coding to the standard is largely a matter of unlearning the various extensions you know.

Outside of some hacks, this means you don’t get conditionals (if, else, etc.). With some practice, both with sticking to portable code and writing portable Makefiles, you’ll find that you don’t really need them. Following the macro conventions will cover most situations. For example:

You don’t need to do anything weird with the assignments. The user invoking make can override them easily. For example, here’s part of a Makefile:

CC     = c99
CFLAGS = -Wall -Wextra -Os

But the user wants to use clang, and their system needs to explicitly link -lsocket (e.g. Solaris). The user can override the macro definitions on the command line:

$ make CC=clang LDLIBS=-lsocket

The same rules apply to the programs you invoke from the Makefile. Read the standards documents and ignore your system’s man pages as to avoid accidentally using an extension. It’s especially valuable to learn the Bourne shell language and avoid any accidental bashisms in your Makefiles and scripts. The dash shell is good for testing your scripts.

Makefiles conforming to the standard will, unfortunately, be more verbose than those taking advantage of a particular implementation. If you know how to code Bourne shell — which is not terribly difficult to learn — then you might even consider hand-writing a configure script to generate the Makefile (a la metaprogramming). This gives you a more flexible language with conditionals, and, being generated, redundancy in the Makefile no longer matters.

As someone who frequently dabbles with BSD systems, my life has gotten a lot easier since learning to write portable Makefiles and scripts.

But what about Windows

It’s the elephant in the room and I’ve avoided talking about it so far. If you want to build with Visual Studio’s command line tools — something I do on occasion — build portability goes out the window. Visual Studio has nmake.exe, which nearly conforms to POSIX make. However, without the standard unix utilities and with the completely foreign compiler interface for cl.exe, there’s absolutely no hope of writing a Makefile portable to this situation.

The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the unix utilities, though it has the problem of linking against msvcrt.dll. Another option is a separate Makefile dedicated to nmake.exe and the Visual Studio toolchain. Good luck defining a correctly working “clean” target with del.exe.

My preferred approach lately is an amalgamation build (as seen in Enchive): Carefully concatenate all the application’s sources into one giant source file. First concatenate all the headers in the right order, followed by all the C files. Use sed to remove and local includes. You can do this all on a unix system with the nice utilities, then point cl.exe at the amalgamation for the Visual Studio build. It’s not very useful for actual development (i.e. you don’t want to edit the amalgamation), but that’s what MinGW-w64 resolves.

What about all those POSIX functions? You’ll need to find Win32 replacements on MSDN. I prefer to do this is by abstracting those operating system calls. For example, compare POSIX sleep(3) and Win32 Sleep().

#if defined(_WIN32)
#include <windows.h>

void
my_sleep(int s)
{
    Sleep(s * 1000);  // TODO: handle overflow, maybe
}

#else /* __unix__ */
#include <unistd.h>

void
my_sleep(int s)
{
    sleep(s);  // TODO: fix signal interruption
}
#endif

Then the rest of the program calls my_sleep(). There’s another example in the OpenMP article with pwrite(2) and WriteFile(). This demonstrates that supporting a bunch of different unix-like systems is really easily, but introducing Windows portability adds a disproportionate amount of complexity.

Caveat: paths and filenames

There’s one major complication with filenames for applications portable to Windows. In the unix world, filenames are null-terminated bytestrings. Typically these are Unicode strings encoded as UTF-8, but it’s not necessarily so. The kernel just sees bytestrings. A bytestring doesn’t necessarily have a formal Unicode representation, which can be a problem for languages that want filenames to be Unicode strings (also).

On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up being neither. They’re really null-terminated unsigned 16-bit integer arrays. It’s almost UTF-16 except that Windows allows unpaired surrogates. This means Windows filenames also don’t have a formal Unicode representation, but in a completely different way than unix. Some heroic efforts have gone into working around this issue.

As a result, it’s highly non-trivial to correctly support all possible filenames on both systems in the same program, especially when they’re passed as command line arguments.

Summary

The key points are:

  1. Document the standards your application requires and strictly stick to them.
  2. Ignore the vendor documentation if it doesn’t clearly delineate extensions.

This was all a discussion of non-GUI applications, and I didn’t really touch on libraries. Many libraries are simple to access in the build (just add it to LDLIBS), but some libraries — GUIs in particular — are particularly complicated to manage portably and will require a more complex solution (pkg-config, CMake, Autoconf, etc.).

null program

Chris Wellons