Fibers: the Most Elegant Windows API

This article was discussed on Hacker News.

The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.

That’s why, when I recently revisited the Fibers API, I was pleasantly surprised. It’s one of the exceptions — much cleaner than the optional, deprecated, and now obsolete POSIX equivalent. It’s not quite an apples-to-apples comparison since the POSIX version is slightly more powerful, and more complicated as a result. I’ll cover the difference in this article.

For the last part of this article, I’ll walk through an async/await framework build on top of fibers. The framework allows coroutines in C programs to await on arbitrary kernel objects.

Fiber Async/await Demo

Fibers

Windows fibers are really just stackful, symmetric coroutines. From a different point of view, they’re cooperatively scheduled threads, which is the source of the analogous name, fibers. They’re symmetric because all fibers are equal, and no fiber is the “main” fiber. If any fiber returns from its start routine, the program exits. (Older versions of Wine will crash when this happens, but it was recently fixed.) It’s equivalent to the process’ main thread returning from main(). The initial fiber is free to create a second fiber, yield to it, then the second fiber destroys the first.

For now I’m going to focus on the core set of fiber functions. There are some additional capabilities I’m going to ignore, including support for fiber local storage. The important functions are just these five:

void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void  SwitchToFiber(void *fiber);
bool  ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void  DeleteFiber(void *fiber);

To emphasize its simplicity, I’ve shown them here with more standard prototypes than seen in their formal documentation. That documentation uses the clunky Windows API typedefs still burdened with its 16-bit heritage — e.g. LPVOID being a “long pointer” from the segmented memory of the 8086:

Fibers are represented using opaque, void pointers. Maybe that’s a little too simple since it’s easy to misuse in C, but I like it. The return values for CreateFiber() and ConvertThreadToFiber() are void pointers since these both create fibers.

The fiber start routine returns nothing and takes a void “user pointer”. That’s nearly what I’d expect, except that it would probably make more sense for a fiber to return int, which is more in line with main / WinMain / mainCRTStartup / WinMainCRTStartup. As I said, when any fiber returns from its start routine, it’s like returning from the main function, so it should probably have returned an integer.

A fiber may delete itself, which is the same as exiting the thread. However, a fiber cannot yield (e.g. SwitchToFiber()) to itself. That’s undefined behavior.

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

void
coup(void *king)
{
    puts("Long live the king!");
    DeleteFiber(king);
    ConvertFiberToThread(); /* seize the main thread */
    /* ... */
}

int
main(void)
{
    void *king = ConvertThreadToFiber(0);
    void *pretender = CreateFiber(0, coup, king);
    SwitchToFiber(pretender);
    abort(); /* unreachable */
}

Only fibers can yield to fibers, but when the program starts up, there are no fibers. At least one thread must first convert itself into a fiber using ConvertThreadToFiber(), which returns the fiber object that represents itself. It takes one argument analogous to the last argument of CreateFiber(), except that there’s no start routine to accept it. The process is reversed with ConvertFiberToThread().

Fibers don’t belong to any particular thread and can be scheduled on any thread if properly synchronized. Obviously one should never yield to the same fiber in two different threads at the same time.

Contrast with POSIX

The equivalent POSIX systems was context switching. It’s also stackful and symmetric, but it has just three important functions: getcontext(3), makecontext(3), and swapcontext.

int  getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int  swapcontext(ucontext_t *oucp, const ucontext_t *ucp);

These are roughly equivalent to GetCurrentFiber(), CreateFiber(), and SwitchToFiber(). There is no need for ConvertFiberToThread() since threads can context switch without preparation. There’s also no DeleteFiber() because the resources are managed by the program itself. That’s where POSIX contexts are a little bit more powerful.

The first argument to CreateFiber() is the desired stack size, with zero indicating the default stack size. The stack is allocated and freed by the operating system. The downside is that the caller doesn’t have a choice in managing the lifetime of this stack and how it’s allocated. If you’re frequently creating and destroying coroutines, those stacks are constantly being allocated and freed.

In makecontext(3), the caller allocates and supplies the stack. Freeing that stack is equivalent to destroying the context. A program that frequently creates and destroys contexts can maintain a stack pool or otherwise more efficiently manage their allocation. This makes it more powerful, but it also makes it a little more complicated. It would be hard to remember how to do all this without a careful reading of the documentation:

/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);

/* Destroy a context */
free(ctx.uc_stack.ss_sp);

Note how makecontext(3) is variadic (...), passing its arguments on to the start routine of the context. This seems like it might be better than a user pointer. Unfortunately it’s not, since those arguments are strictly limited to integers.

Ultimately I like the fiber API better. The first time I tried it out, I could guess my way through it without looking closely at the documentation.

Async / await with fibers

Why was I looking at the Fiber API? I’ve known about coroutines for years but I didn’t understand how they could be useful. Sure, the function can yield, but what other coroutine should it yield to? It wasn’t until I was recently bit by the async/await bug that I finally saw a “killer feature” that justified their use. Generators come pretty close, though.

Windows fibers are a coroutine primitive suitable for async/await in C programs, where it can also be useful. To prove that it’s possible, I built async/await on top of fibers in 95 lines of code.

The alternatives are to use a third-party coroutine library or to do it myself with some assembly programming. However, having it built into the operating system is quite convenient! It’s unfortunate that it’s limited to Windows. Ironically, though, everything I wrote for this article, including the async/await demonstration, was originally written on Linux using Mingw-w64 and tested using Wine. Only after I was done did I even try it on Windows.

Before diving into how it works, there’s a general concept about the Windows API that must be understood: All kernel objects can be in either a signaled or unsignaled state. The API provides functions that block on a kernel object until it is signaled. The two important ones are WaitForSingleObject() and WaitForMultipleObjects(). The latter behaves very much like poll(2) in POSIX.

Usually the signal is tied to some useful event, like a process or thread exiting, the completion of an I/O operation (i.e. asynchronous overlapped I/O), a semaphore being incremented, etc. It’s a generic way to wait for some event. However, instead of blocking the thread, wouldn’t it be nice to await on the kernel object? In my aio library for Emacs, the fundamental “wait” object was a promise. For this API it’s a kernel object handle.

So, the await function will take a kernel object, register it with the scheduler, then yield to the scheduler. The scheduler — which is a global variable, so there’s only one scheduler per process — looks like this:

struct {
    void *main_fiber;
    HANDLE handles[MAXIMUM_WAIT_OBJECTS];
    void *fibers[MAXIMUM_WAIT_OBJECTS];
    void *dead_fiber;
    int count;
} async_loop;

While fibers are symmetric, coroutines in my async/await implementation are not. One fiber is the scheduler, main_fiber, and the other fibers always yield to it.

There is an array of kernel object handles, handles, and an array of fibers. The elements in these arrays are paired with each other, but it’s convenient to store them separately, as I’ll show soon. fibers[0] is waiting on handles[0], and so on.

The array is a fixed size, MAXIMUM_WAIT_OBJECTS (64), because there’s a hard limit on the number of fibers that can wait at once. This pathetically small limitation is an unfortunate, hard-coded restriction of the Windows API. It kills most practical uses of my little library. Fortunately there’s no limit on the number of handles we might want to wait on, just the number of co-existing fibers.

When a fiber is about to return from its start routine, it yields one last time and registers itself on the dead_fiber member. The scheduler will delete this fiber as soon as it’s given control. Fibers never truly return since that would terminate the program.

With this, the await function, async_await(), is pretty simple. It registers the handle with the scheduler, then yields to the scheduler fiber.

void
async_await(HANDLE h)
{
    async_loop.handles[async_loop.count] = h;
    async_loop.fibers[async_loop.count] = GetCurrentFiber();
    async_loop.count++;
    SwitchToFiber(async_loop.main_fiber);
}

Caveat: The scheduler destroys this handle with CloseHandle() after it signals, so don’t try to reuse it. This made my demonstration simpler, but it might be better to not do this.

A fiber can exit at any time. Such an exit is inserted implicitly before a fiber actually returns:

void
async_exit(void)
{
    async_loop.dead_fiber = GetCurrentFiber();
    SwitchToFiber(async_loop.main_fiber);
}

The start routine given to async_start() is actually wrapped in the real start routine. This is how async_exit() is injected:

struct fiber_wrapper {
    void (*func)(void *);
    void *arg;
};

static void
fiber_wrapper(void *arg)
{
    struct fiber_wrapper *fw = arg;
    fw->func(fw->arg);
    async_exit();
}

int
async_start(void (*func)(void *), void *arg)
{
    if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
        return 0;
    } else {
        struct fiber_wrapper fw = {func, arg};
        SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
        return 1;
    }
}

The library provides a single awaitable function, async_sleep(). It creates a “waitable timer” object, starts the countdown, and returns it. (Notice how SetWaitableTimer() is a typically-ugly Win32 function with excessive parameters.)

HANDLE
async_sleep(double seconds)
{
    HANDLE promise = CreateWaitableTimer(0, 0, 0);
    LARGE_INTEGER t;
    t.QuadPart = (long long)(seconds * -10000000.0);
    SetWaitableTimer(promise, &t, 0, 0, 0, 0);
    return promise;
}

A more realistic example would be overlapped I/O. For example, you’d open a file (CreateFile()) in overlapped mode, then when you, say, read from that file (ReadFile()) you create an event object (CreateEvent()), populate an overlapped I/O structure with the event, offset, and length, then finally await on the event object. The fiber will be resumed when the operation is complete.

Side note: Unfortunately overlapped I/O doesn’t work correctly for files, and many operations can’t be done asynchronously, like opening files. When it comes to files, you’re better off using dedicated threads as libuv does instead of overlapped I/O. You can still await on these operations. You’d just await on the signal from the thread doing synchronous I/O, not from overlapped I/O.

The most complex part is the scheduler, and it’s really not complex at all:

void
async_run(void)
{
    while (async_loop.count) {
        /* Wait for next event */
        DWORD nhandles = async_loop.count;
        HANDLE *handles = async_loop.handles;
        DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);

        /* Remove event and fiber from waiting array */
        void *fiber = async_loop.fibers[r];
        CloseHandle(async_loop.handles[r]);
        async_loop.handles[r] = async_loop.handles[nhandles - 1];
        async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
        async_loop.count--;

        /* Run the fiber */
        SwitchToFiber(fiber);

        /* Destroy the fiber if it exited */
        if (async_loop.dead_fiber) {
            DeleteFiber(async_loop.dead_fiber);
            async_loop.dead_fiber = 0;
        }
    }
}

This is why the handles are in their own array. The array can be passed directly to WaitForMultipleObjects(). The return value indicates which handle was signaled. The handle is closed, the entry removed from the scheduler, and then the fiber is resumed.

That WaitForMultipleObjects() is what limits the number of fibers. It’s not possible to wait on more than 64 handles at once! This is hard-coded into the API. How? A return value of 64 is an error code, and changing this would break the API. Remember what I said about being locked into bad design decisions of the past?

To be fair, WaitForMultipleObjects() was a doomed API anyway, just like select(2) and poll(2) in POSIX. It scales very poorly since the entire array of objects being waited on must be traversed on each call. That’s terribly inefficient when waiting on large numbers of objects. This sort of problem is solved by interfaces like kqueue (BSD), epoll (Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this particular problem well — awaiting on kernel objects — so I couldn’t use it.

When the awaiting fiber count is zero and the scheduler has control, all fibers must have completed and there’s nothing left to do. However, the caller can schedule more fibers and then restart the scheduler if desired.

That’s all there is to it. Have a look at demo.c to see how the API looks in some trivial examples. On Linux you can see it in action with make check. On Windows, you just need to compile it, then run it like a normal program. If there was a better function than WaitForMultipleObjects() in the Windows API, I would have considered turning this demonstration into a real library.

Endlessh: an SSH Tarpit

This article was discussed on Hacker News and on reddit (also).

I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.

The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost

Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.

How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF

The RFC also notes:

The server MAY send other lines of data before sending the version string.

There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.

This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a fork server: accept a connection, fork, and the child simply alternates between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}

A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a single-threaded poll(2) server, analogous to stackless green threads. The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their only overhead is the socket object in the kernel, and another 78 bytes to track them in Endlessh. Most of those bytes are used only for accurate logging. Only those clients that are overdue for a new line are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is essentially used in place of sleep(3). Though since it still needs to manage the accept server socket, it (almost) never actually waits on nothing.

There’s an option to limit the total number of client connections so that it doesn’t get out of hand. In this case it will stop polling the accept socket until a client disconnects. I probably shouldn’t have bothered with this option and instead relied on ulimit, a feature already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much more efficient than poll(2). The problem with poll(2) is that it’s constantly registering and unregistering Endlessh on each of the overdue sockets each time around the main loop. This is by far the most CPU-intensive part of Endlessh, and it’s all inflicted on the kernel. Most of the time, even with thousands of clients trapped in the tarpit, only a small number of them at polled at once, so I opted for better portability instead.

One consequence of not polling connections that are waiting is that disconnections aren’t noticed in a timely fashion. This makes the logs less accurate than I like, but otherwise it’s pretty harmless. Unforunately even if I wanted to fix this, the poll(2) interface isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the kernel, where it allocates send and receive buffers for each client and manages the proper TCP state. The next step to reducing this overhead is Endlessh opening a raw socket and speaking TCP itself, bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asycio, and it’s actually a near perfect fit for this problem. I should have just used it in the first place. The hard part is already implemented within asyncio, and the problem isn’t CPU-bound, so being written in Python doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asycio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Try it out for yourself. Firefox and Chrome will spin on that server for hours before giving up. I have yet to see curl actually timeout on its own in the default settings (--max-time/-m does work correctly, though).

Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.

An Async / Await Library for Emacs Lisp

As part of building my Python proficiency, I’ve learned how to use asyncio. This new language feature first appeared in Python 3.5 (PEP 492, September 2015). JavaScript grew a nearly identical feature in ES2017 (June 2017). An async function can pause to await on an asynchronously computed result, much like a generator pausing when it yields a value.

In fact, both Python and JavaScript async functions are essentially just fancy generator functions with some specialized syntax and semantics. That is, they’re stackless coroutines. Both languages already had generators, so their generator-like async functions are a natural extension that — unlike stackful coroutines — do not require significant, new runtime plumbing.

Emacs officially got generators in 25.1 (September 2016), though, unlike Python and JavaScript, it didn’t require any additional support from the compiler or runtime. It’s implemented entirely using Lisp macros. In other words, it’s just another library, not a core language feature. In theory, the generator library could be easily backported to the first Emacs release to properly support lexical closures, Emacs 24.1 (June 2012).

For the same reason, stackless async/await coroutines can also be implemented as a library. So that’s what I did, letting Emacs’ generator library do most of the heavy lifting. The package is called aio:

It’s modeled more closely on JavaScript’s async functions than Python’s asyncio, with the core representation being promises rather than a coroutine objects. I just have an easier time reasoning about promises than coroutines.

I’m definitely not the first person to realize this was possible, and was beaten to the punch by two years. Wanting to avoid fragmentation, I set aside all formality in my first iteration on the idea, not even bothering with namespacing my identifiers. It was to be only an educational exercise. However, I got quite attached to my little toy. Once I got my head wrapped around the problem, everything just sort of clicked into place so nicely.

In this article I will show step-by-step one way to build async/await on top of generators, laying out one concept at a time and then building upon each. But first, some examples to illustrate the desired final result.

aio example

Ignoring all its problems for a moment, suppose you want to use url-retrieve to fetch some content from a URL and return it. To keep this simple, I’m going to omit error handling. Also assume that lexical-binding is t for all examples. Besides, lexical scope required by the generator library, and therefore also required by aio.

The most naive approach is to fetch the content synchronously:

(defun fetch-fortune-1 (url)
  (let ((buffer (url-retrieve-synchronously url)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

The result is returned directly, and errors are communicated by an error signal (e.g. Emacs’ version of exceptions). This is convenient, but the function will block the main thread, locking up Emacs until the result has arrived. This is obviously very undesirable, so, in practice, everyone nearly always uses the asynchronous version:

(defun fetch-fortune-2 (url callback)
  (url-retrieve url (lambda (_status)
                      (funcall callback (buffer-string)))))

The main thread no longer blocks, but it’s a whole lot less convenient. The result isn’t returned to the caller, and instead the caller supplies a callback function. The result, whether success or failure, will be delivered via callback, so the caller must split itself into two pieces: the part before the callback and the callback itself. Errors cannot be delivered using a error signal because of the inverted flow control.

The situation gets worse if, say, you need to fetch results from two different URLs. You either fetch results one at a time (inefficient), or you manage two different callbacks that could be invoked in any order, and therefore have to coordinate.

Wouldn’t it be nice for the function to work like the first example, but be asynchronous like the second example? Enter async/await:

(aio-defun fetch-fortune-3 (url)
  (let ((buffer (aio-await (aio-url-retrieve url))))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

A function defined with aio-defun is just like defun except that it can use aio-await to pause and wait on any other function defined with aio-defun — or, more specifically, any function that returns a promise. Borrowing Python parlance: Returning a promise makes a function awaitable. If there’s an error, it’s delivered as a error signal from aio-url-retrieve, just like the first example. When called, this function returns immediately with a promise object that represents a future result. The caller might look like this:

(defcustom fortune-url ...)

(aio-defun display-fortune ()
  (interactive)
  (message "%s" (aio-await (fetch-fortune-3 fortune-url))))

How wonderfully clean that looks! And, yes, it even works with interactive like that. I can M-x display-fortune and a fortune is printed in the minibuffer as soon as the result arrives from the server. In the meantime Emacs doesn’t block and I can continue my work.

You can’t do anything you couldn’t already do before. It’s just a nicer way to organize the same callbacks: implicit rather than explicit.

Promises, simplified

The core object at play is the promise. Promises are already a rather simple concept, but aio promises have been distilled to their essence, as they’re only needed for this singular purpose. More on this later.

As I said, a promise represents a future result. In practical terms, a promise is just an object to which one can subscribe with a callback. When the result is ready, the callbacks are invoked. Another way to put it is that promises reify the concept of callbacks. A callback is no longer just the idea of extra argument on a function. It’s a first-class thing that itself can be passed around as a value.

Promises have two slots: the final promise result and a list of subscribers. A nil result means the result hasn’t been computed yet. It’s so simple I’m not even bothering with cl-struct.

(defun aio-promise ()
  "Create a new promise object."
  (record 'aio-promise nil ()))

(defsubst aio-promise-p (object)
  (and (eq 'aio-promise (type-of object))
       (= 3 (length object))))

(defsubst aio-result (promise)
  (aref promise 1))

To subscribe to a promise, use aio-listen:

(defun aio-listen (promise callback)
  (let ((result (aio-result promise)))
    (if result
        (run-at-time 0 nil callback result)
      (push callback (aref promise 2)))))

If the result isn’t ready yet, add the callback to the list of subscribers. If the result is ready call the callback in the next event loop turn using run-at-time. This is important because it keeps all the asynchronous components isolated from one another. They won’t see each others’ frames on the call stack, nor frames from aio. This is so important that the Promises/A+ specification is explicit about it.

The other half of the equation is resolving a promise, which is done with aio-resolve. Unlike other promises, aio promises don’t care whether the promise is being fulfilled (success) or rejected (error). Instead a promise is resolved using a value function — or, usually, a value closure. Subscribers receive this value function and extract the value by invoking it with no arguments.

Why? This lets the promise’s resolver decide the semantics of the result. Instead of returning a value, this function can instead signal an error, propagating an error signal that terminated an async function. Because of this, the promise doesn’t need to know how it’s being resolved.

When a promise is resolved, subscribers are each scheduled in their own event loop turns in the same order that they subscribed. If a promise has already been resolved, nothing happens. (Thought: Perhaps this should be an error in order to catch API misuse?)

(defun aio-resolve (promise value-function)
  (unless (aio-result promise)
    (let ((callbacks (nreverse (aref promise 2))))
      (setf (aref promise 1) value-function
            (aref promise 2) ())
      (dolist (callback callbacks)
        (run-at-time 0 nil callback value-function)))))

If you’re not an async function, you might subscribe to a promise like so:

(aio-listen promise (lambda (v)
                      (message "%s" (funcall v))))

The simplest example of a non-async function that creates and delivers on a promise is a “sleep” function:

(defun aio-sleep (seconds &optional result)
  (let ((promise (aio-promise))
        (value-function (lambda () result)))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))

Similarly, here’s a “timeout” promise that delivers a special timeout error signal at a given time in the future.

(defun aio-timeout (seconds)
  (let ((promise (aio-promise))
        (value-function (lambda () (signal 'aio-timeout nil))))
    (prog1 promise
      (run-at-time seconds nil
                   #'aio-resolve promise value-function))))

That’s all there is to promises.

Evaluate in the context of a promise

Before we get into pausing functions, lets deal with the slightly simpler matter of delivering their return values using a promise. What we need is a way to evaluate a “body” and capture its result in a promise. If the body exits due to a signal, we want to capture that as well.

Here’s a macro that does just this:

(defmacro aio-with-promise (promise &rest body)
  `(aio-resolve ,promise
                (condition-case error
                    (let ((result (progn ,@body)))
                      (lambda () result))
                  (error (lambda ()
                           (signal (car error) ; rethrow
                                   (cdr error)))))))

The body result is captured in a closure and delivered to the promise. If there’s an error signal, it’s “rethrown” into subscribers by the promise’s value function.

This is where Emacs Lisp has a serious weak spot. There’s not really a concept of rethrowing a signal. Unlike a language with explicit exception objects that can capture a snapshot of the backtrace, the original backtrace is completely lost where the signal is caught. There’s no way to “reattach” it to the signal when it’s rethrown. This is unfortunate because it would greatly help debugging if you got to see the full backtrace on the other side of the promise.

Async functions

So we have promises and we want to pause a function on a promise. Generators have iter-yield for pausing an iterator’s execution. To tackle this problem:

  1. Yield the promise to pause the iterator.
  2. Subscribe a callback on the promise that continues the generator (iter-next) with the promise’s result as the yield result.

All the hard work is done in either side of the yield, so aio-await is just a simple wrapper around iter-yield:

(defmacro aio-await (expr)
  `(funcall (iter-yield ,expr)))

Remember, that funcall is here to extract the promise value from the value function. If it signals an error, this propagates directly into the iterator just as if it had been a direct call — minus an accurate backtrace.

So aio-lambda / aio-defun needs to wrap the body in a generator (iter-lamba), invoke it to produce a generator, then drive the generator using callbacks. Here’s a simplified, unhygienic definition of aio-lambda:

(defmacro aio-lambda (arglist &rest body)
  `(lambda (&rest args)
     (let ((promise (aio-promise))
           (iter (apply (iter-lambda ,arglist
                          (aio-with-promise promise
                            ,@body))
                        args)))
       (prog1 promise
         (aio--step iter promise nil)))))

The body is evaluated inside aio-with-promise with the result delivered to the promise returned directly by the async function.

Before returning, the iterator is handed to aio--step, which drives the iterator forward until it delivers its first promise. When the iterator yields a promise, aio--step attaches a callback back to itself on the promise as described above. Immediately driving the iterator up to the first yielded promise “primes” it, which is important for getting the ball rolling on any asynchronous operations.

If the iterator ever yields something other than a promise, it’s delivered right back into the iterator.

(defun aio--step (iter promise yield-result)
  (condition-case _
      (cl-loop for result = (iter-next iter yield-result)
               then (iter-next iter (lambda () result))
               until (aio-promise-p result)
               finally (aio-listen result
                                   (lambda (value)
                                     (aio--step iter promise value))))
    (iter-end-of-sequence)))

When the iterator is done, nothing more needs to happen since the iterator resolves its own return value promise.

The definition of aio-defun just uses aio-lambda with defalias. There’s nothing to it.

That’s everything you need! Everything else in the package is merely useful, awaitable functions like aio-sleep and aio-timeout.

Composing promises

Unfortunately url-retrieve doesn’t support timeouts. We can work around this by composing two promises: a url-retrieve promise and aio-timeout promise. First define a promise-returning function, aio-select that takes a list of promises and returns (as another promise) the first promise to resolve:

(defun aio-select (promises)
  (let ((result (aio-promise)))
    (prog1 result
      (dolist (promise promises)
        (aio-listen promise (lambda (_)
                              (aio-resolve
                               result
                               (lambda () promise))))))))

We give aio-select both our url-retrieve and timeout promises, and it tells us which resolved first:

(aio-defun fetch-fortune-4 (url timeout)
  (let* ((promises (list (aio-url-retrieve url)
                         (aio-timeout timeout)))
         (fastest (aio-await (aio-select promises)))
         (buffer (aio-await fastest)))
    (with-current-buffer buffer
      (prog1 (buffer-string)
        (kill-buffer)))))

Cool! Note: This will not actually cancel the URL request, just move the async function forward earlier and prevent it from getting the result.

Threads

Despite aio being entirely about managing concurrent, asynchronous operations, it has nothing at all to do with threads — as in Emacs 26’s support for kernel threads. All async functions and promise callbacks are expected to run only on the main thread. That’s not to say an async function can’t await on a result from another thread. It just must be done very carefully.

Processes

The package also includes two functions for realizing promises on processes, whether they be subprocesses or network sockets.

For example, this function loops over each chunk of output (typically 4kB) from the process, as delivered to a filter function:

(aio-defun process-chunks (process)
  (cl-loop for chunk = (aio-await (aio-process-filter process))
           while chunk
           do (... process chunk ...)))

Exercise for the reader: Write an awaitable function that returns a line at at time rather than a chunk at a time. You can build it on top of aio-process-filter.

I considered wrapping functions like start-process so that their aio versions would return a promise representing some kind of result from the process. However there are so many different ways to create and configure processes that I would have ended up duplicating all the process functions. Focusing on the filter and sentinel, and letting the caller create and configure the process is much cleaner.

Unfortunately Emacs has no asynchronous API for writing output to a process. Both process-send-string and process-send-region will block if the pipe or socket is full. There is no callback, so you cannot await on writing output. Maybe there’s a way to do it with a dedicated thread?

Another issue is that the process-send-* functions are preemptible, made necessary because they block. The aio-process-* functions leave a gap (i.e. between filter awaits) where no filter or sentinel function is attached. It’s a consequence of promises being single-fire. The gap is harmless so long as the async function doesn’t await something else or get preempted. This needs some more thought.

Update: These process functions no longer exist and have been replaced by a small framework for building chains of promises. See aio-make-callback.

Testing aio

The test suite for aio is a bit unusual. Emacs’ built-in test suite, ERT, doesn’t support asynchronous tests. Furthermore, tests are generally run in batch mode, where Emacs invokes a single function and then exits rather than pump an event loop. Batch mode can only handle asynchronous process I/O, not the async functions of aio. So it’s not possible to run the tests in batch mode.

Instead I hacked together a really crude callback-based test suite. It runs in non-batch mode and writes the test results into a buffer (run with make check). Not ideal, but it works.

One of the tests is a sleep sort (with reasonable tolerances). It’s a pretty neat demonstration of what you can do with aio:

(aio-defun sleep-sort (values)
  (let ((promises (mapcar (lambda (v) (aio-sleep v v)) values)))
    (cl-loop while promises
             for next = (aio-await (aio-select promises))
             do (setf promises (delq next promises))
             collect (aio-await next))))

To see it in action (M-x sleep-sort-demo):

(aio-defun sleep-sort-demo ()
  (interactive)
  (let ((values '(0.1 0.4 1.1 0.2 0.8 0.6)))
    (message "%S" (aio-await (sleep-sort values)))))

Async/await is pretty awesome

I’m quite happy with how this all came together. Once I had the concepts straight — particularly resolving to value functions — everything made sense and all the parts fit together well, and mostly by accident. That feels good.

Python Decorators: Syntactic Artificial Sweetener

Python has a feature called function decorators. With a little bit of syntax, the behavior of a function or class can be modified in useful ways. Python comes with a few decorators, but most of the useful ones are found in third-party libraries.

PEP 318 suggests a very simple, but practical decorator called synchronized, though it doesn’t provide a concrete example. Consider this function that increments a global counter:

counter = 0

def increment():
    global counter
    counter = counter + 1

If this function is called from multiple threads, there’s a race condition — though, at least for CPython, it’s not a data race thanks to the Global Interpreter Lock (GIL). Incrementing the counter is not an atomic operation, as illustrated by its byte code:

 0 LOAD_GLOBAL              0 (counter)
 3 LOAD_CONST               1 (1)
 6 BINARY_ADD
 7 STORE_GLOBAL             0 (counter)
10 LOAD_CONST               0 (None)
13 RETURN_VALUE

The variable is loaded, operated upon, and stored. Another thread could be scheduled between any of these instructions and cause an undesired result. It’s easy to see that in practice:

from threading import Thread

def worker():
    for i in range(200000):
        increment()

threads = [Thread(target=worker) for _ in range(8)];
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

print(counter)

The increment function is called exactly 1.6 million times, but on my system I get different results on each run:

$ python3 example.py 
1306205
$ python3 example.py 
1162418
$ python3 example.py 
1076801

I could change the definition of increment() to use synchronization, but wouldn’t it be nice if I could just tell Python to synchronize this function? This is where a function decorator shines:

from threading import Lock

def synchronized(f):
    lock = Lock()
    def wrapper():
        with lock:
            return f()
    return wrapper

The synchronized function is a higher order function that accepts a function and returns a function — or, more specifically, a callable. The purpose is to wrap and decorate the function it’s given. In this case the function is wrapped in a mutual exclusion lock. Note: This implementation is very simple and only works for functions that accept no arguments.

To use it, I just add a single line to increment:

@synchronized
def increment():
    global counter
    counter = counter + 1

With this change my program now always prints 1600000.

Syntactic “sugar”

Everyone is quick to point out that this is just syntactic sugar, and that you can accomplish this without the @ syntax. For example, the last definition of increment is equivalent to:

def increment():
    ...

increment = synchronized(increment)

Decorators can also be parameterized. For example, Python’s functools module has an lru_cache decorator for memoizing a function:

@lru_cache(maxsize=32)
def expensive(n):
    ...

Which is equivalent to this very direct source transformation:

def expensive(n):
    ...

expensive = lru_cache(maxsize=32)(expensive)

So what comes after the @ isn’t just a name. In fact, it looks like it can be any kind of expression that evaluates to a function decorator. Or is it?

Syntactic artificial sweetener

Reality is often disappointing. Let’s try using an “identity” decorator defined using lambda. This decorator will accomplish nothing, but it will test if we can decorate a function using a lambda expression.

@lambda f: f
def foo():
    pass

But Python complains:

    @lambda f: f
          ^
SyntaxError: invalid syntax

Maybe Python is absolutely literal about the syntax sugar thing, and it’s more like a kind of macro replacement. Let’s try wrapping it in parentheses:

@(lambda f: f)
def foo(n):
    pass

Nope, same error, but now pointing at the opening parenthesis. Getting desperate now:

@[synchronized][0]
def foo():
    pass

Again, syntax error. What’s going on?

Pattern matching

The problem is that the Python language reference doesn’t parse an expression after @. It matches a very specific pattern that just so happens to look like a Python expression. It’s not syntactic sugar, it’s syntactic artificial sweetener!

ator ::= "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE

In a way, this puts Python in the ranks of PHP 5 and Matlab: two languages with completely screwed up grammars that can only parse specific constructions that the developers had anticipated. For example, in PHP 5 (fixed in PHP 7):

function foo() {
    return function() {
        return 0;
    };
}

foo()();

That is a syntax error:

PHP Parse error:  syntax error, unexpected '(', expecting ',' or ';'

Or in any version of Matlab:

    magic(4)(:)

That is a syntax error:

Unbalanced or unexpected parenthesis or bracket

In Python’s defense, this strange, limited syntax is only in a single place rather than everywhere, but I still wonder why it was defined that way.

Update: Clément Pit-Claudel pointed out the explanation in the PEP, which references a 2004 email by Guido van Rossum:

I have a gut feeling about this one. I’m not sure where it comes from, but I have it. It may be that I want the compiler to be able to recognize certain decorators.

So while it would be quite easy to change the syntax to @test in the future, I’d like to stick with the more restricted form unless a real use case is presented where allowing @test would increase readability. (@foo().bar() doesn’t count because I don’t expect you’ll ever need that).

The CPython Bytecode Compiler is Dumb

This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on Python at work, I recently needed to brush up on my Python skills. So far for me, Python has been little more than a fancy extension language for BeautifulSoup, though I also used it to participate in the recent tradition of writing one’s own static site generator, in this case for my wife’s photo blog. I’ve been reading through Fluent Python by Luciano Ramalho, and it’s been quite effective at getting me up to speed.

As I write Python, like with Emacs Lisp, I can’t help but consider what exactly is happening inside the interpreter. I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options. Ultimately I’d like the code I write to drive the interpreter efficiently and effectively. The Zen of Python says there should “only one obvious way to do it,” but in practice there’s a lot of room for expression. Given multiple ways to express the same algorithm or idea, I tend to prefer the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of Python, is very transparent about its bytecode. It’s easy to inspect and reason about its bytecode. The disassembly listing is easy to read and understand, and I can always follow it without consulting the documentation. This contrasts sharply with modern JavaScript engines and their opaque use of JIT compilation, where performance is guided by obeying certain patterns (hidden classes, etc.), helping the compiler understand my program’s types, and being careful not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been studying the bytecode disassembly of the functions that I write. One fact has become quite apparent: the CPython bytecode compiler is pretty dumb. With a few exceptions, it’s a very literal translation of a Python program, and there is almost no optimization. Below I’ll demonstrate a case where it’s possible to detect one of the missed optimizations without inspecting the bytecode disassembly thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb) because it’s such a close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum himself said, “Python is about having the simplest, dumbest compiler imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance, use something other than Python. (And if you can’t do that, at least use PyPy.) That’s a fairly reasonable and healthy goal. Still, if I’m writing Python, I’d like to do the best I can, which means exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython 3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically typed, are primarily executed on a bytecode virtual machine, and it’s easy to access their disassembly listings. One caveat: CPython and Emacs use a stack-based virtual machine while Lua uses a register-based virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x disassemble, and all code will use lexical scoping. In Lua I’ll use lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the variable around potentially involves allocating memory for it, assigning to it, and accessing it. Take this example:

def foo():
    x = 0
    y = 1
    return x

This function is equivalent to:

def foo():
    return 0

Despite this, CPython completely misses this optimization for both x and y:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE

It assigns both variables, and even loads again from x for the return. Missed optimizations, but, as I said, by keeping these variables around, debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo()
    local x = 0
    local y = 1
    return x
end

It also misses this optimization, though it matters a little less due to its architecture (the return instruction references a register regardless of whether or not that register is allocated to a local variable):

        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1

Emacs Lisp also misses it:

(defun foo ()
  (let ((x 0)
        (y 1))
    x))

Disassembly:

0	constant  0
1	constant  1
2	stack-ref 1
3	return

All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at compile time? This is simple and everyone does it.

def foo():
    return 1 + 2 * 3 / 4

Disassembly:

  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE

Lua:

function foo()
    return 1 + 2 * 3 / 4
end

Disassembly:

        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (+ 1 (/ (* 2 3) 4.0))

Disassembly:

0	constant  2.5
1	return

That’s something we can count on so long as the operands are all numeric literals (or also, for Python, string literals) that are visible to the compiler. Don’t count on your operator overloads to work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects allocated in a function ever become visible outside of that function. If they don’t then these objects could potentially be stack-allocated (instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython does have a trick up its sleeve: tuple optimization. Since tuples are immutable, in certain circumstances CPython will reuse them and avoid both the constructor and the allocation.

def foo():
    return (1, 2, 3)

Check it out, the tuple is used as a constant:

  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE

Which we can detect by evaluating foo() is foo(), which is True. Though deviate from this too much and the optimization is disabled. Remember how CPython can’t optimize away variables, and that they break constant folding? The break this, too:

def foo():
    x = 1
    return (x, 2, 3)

Disassembly:

  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE

This function might document that it always returns a simple tuple, but we can tell if its being optimized or not using is like before: foo() is foo() is now False! In some future version of Python with a cleverer bytecode compiler, that expression might evaluate to True. (Unless the Python language specification is specific about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with is. Was that deliberate? I’m impressed that PyPy matches CPython’s semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break this optimization. But that’s not the compiler being dumb. That’s a hard constraint on the compiler: the caller might change the mutable component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode compilers eliminate unnecessary intermediate objects.

def foo():
    return [1024][0]

Disassembly:

  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE

Lua:

function foo()
    return ({1024})[1]
end

Disassembly:

        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (car (list 1024)))

Disassembly:

0	constant  1024
1	list1
2	car
3	return

Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and so on, and each case is almost certainly unoptimized. The general rule of thumb is to simply not expect much from these bytecode compilers. They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious optimizations from the compiler. This frees me to be more expressive in my code. Lots of things are cost-free thanks to these optimizations, such as breaking a complex expression up into several variables, naming my constants, or not using a local variable to manually cache memory accesses. I’m confident the compiler will optimize away my expressiveness. The catch is that clever compilers can take things too far, so I’ve got to be mindful of how it might undermine my intentions — i.e. when I’m doing something unusual or not strictly permitted.

These bytecode compilers will never truly surprise me. The cost is that being more expressive in Python, Lua, or Emacs Lisp may reduce performance at run time because it shows in the bytecode. Usually this doesn’t matter, but sometimes it does.

Some Galois Linear Feedback Shift Generators

The Day I Fell in Love with Fuzzing

This article was discussed on Hacker News and on reddit.

In 2007 I wrote a pair of modding tools, binitools, for a space trading and combat simulation game named Freelancer. The game stores its non-art assets in the format of “binary INI” files, or “BINI” files. The motivation for the binary format over traditional INI files was probably performance: it’s faster to load and read these files than it is to parse arbitrary text in INI format.

Much of the in-game content can be changed simply by modifying these files — changing time names, editing commodity prices, tweaking ship statistics, or even adding new ships to the game. The binary nature makes them unsuitable to in-place modification, so the natural approach is to convert them to text INI files, make the desired modifications using a text editor, then convert back to the BINI format and replace the file in the game’s installation.

I didn’t reverse engineer the BINI format, nor was I the first person the create tools to edit them. The existing tools weren’t to my tastes, and I had my own vision for how they should work — an interface more closely following the Unix tradition despite the target being a Windows game.

When I got started, I had just learned how to use yacc (really Bison) and lex (really flex), as well as Autoconf, so I went all-in with these newly-discovered tools. It was exciting to try them out in a real-world situation, though I slavishly aped the practices of other open source projects without really understanding why things were they way they were. Due to the use of yacc/lex and the configure script build, compiling the project required a full, Unix-like environment. This is all visible in the original version of the source.

The project was moderately successful in two ways. First, I was able to use the tools to modify the game. Second, other people were using the tools, since the binaries I built show up in various collections of Freelancer modding tools online.

The Rewrite

That’s the way things were until mid-2018 when I revisited the project. Ever look at your own old code and wonder what they heck you were thinking? My INI format was far more rigid and strict than necessary, I was doing questionable things when writing out binary data, and the build wasn’t even working correctly.

With an additional decade of experience under my belt, I knew I could do way better if I were to rewrite these tools today. So, over the course of a few days, I did, from scratch. That’s what’s visible in the master branch today.

I like to keep things simple which meant no more Autoconf, and instead a simple, portable Makefile. No more yacc or lex, and instead a hand-coded parser. Using only conforming, portable C. The result was so simple that I can build using Visual Studio in a single, short command, so the Makefile isn’t all that necessary. With one small tweak (replace stdint.h with a typedef), I can even build and run binitools in DOS.

The new version is faster, leaner, cleaner, and simpler. It’s far more flexible about its INI input, so its easier to use. But is it more correct?

Fuzzing

I’ve been interested in fuzzing for years, especially american fuzzy lop, or afl. However, I wasn’t having success with it. I’d fuzz some of the tools I use regularly, and it wouldn’t find anything of note, at least not before I gave up. I fuzzed my JSON library, and somehow it turned up nothing. Surely my JSON parser couldn’t be that robust already, could it? Fuzzing just wasn’t accomplishing anything for me. (As it turns out, my JSON library is quite robust, thanks in large part to various contributors!)

So I’ve got this relatively new INI parser, and while it can successfully parse and correctly re-assemble the game’s original set of BINI files, it hasn’t really been exercised that much. Surely there’s something in here for a fuzzer to find. Plus I don’t even have to write a line of code in order to run afl against it. The tools already read from standard input by default, which is perfect.

Assuming you’ve got the necessary tools installed (make, gcc, afl), here’s how easy it is to start fuzzing binitools:

$ make CC=afl-gcc
$ mkdir in out
$ echo '[x]' > in/empty
$ afl-fuzz -i in -o out -- ./bini

The bini utility takes INI as input and produces BINI as output, so it’s far more interesting to fuzz than its inverse, unbini. Since unbini parses relatively simple binary data, there are (probably) no bugs for the fuzzer to find. I did try anyway just in case.

In my example above, I swapped out the default compiler for afl’s GCC wrapper (CC=afl-gcc). It calls GCC in the background, but in doing so adds its own instrumentation to the binary. When fuzzing, afl-fuzz uses that instrumentation to monitor the program’s execution path. The afl whitepaper explains the technical details.

I also created input and output directories, placing a minimal, working example into the input directory, which gives afl a starting point. As afl runs, it mutates a queue of inputs and observes the changes on the program’s execution. The output directory contains the results and, more importantly, a corpus of inputs that cause unique execution paths. In other words, the fuzzer output will be lots of inputs that exercise many different edge cases.

The most exciting and dreaded result is a crash. The first time I ran it against binitools, bini had many such crashes. Within minutes, afl was finding a number of subtle and interesting bugs in my program, which was incredibly useful. It even discovered an unlikely stale pointer bug by exercising different orderings for various memory allocations. This particular bug was the turning point that made me realize the value of fuzzing.

Not all the bugs it found led to crashes. I also combed through the outputs to see what sorts of inputs were succeeding, what was failing, and observe how my program handled various edge cases. It was rejecting some inputs I thought should be valid, accepting some I thought should be invalid, and interpreting some in ways I hadn’t intended. So even after I fixed the crashing inputs, I still made tweaks to the parser to fix each of these troublesome inputs.

Building a test suite

Once I combed out all the fuzzer-discovered bugs, and I agreed with the parser on how all the various edge cases should be handled, I turned the fuzzer’s corpus into a test suite — though not directly.

I had run the fuzzer in parallel — a process that is explained in the afl documentation — so I had lots of redundant inputs. By redundant I mean that the inputs are different but have the same execution path. Fortunately afl has a tool to deal with this: afl-cmin, the corpus minimization tool. It eliminates all the redundant inputs.

Second, many of these inputs were longer than necessary in order to invoke their unique execution path. There’s afl-tmin, the test case minimizer, which I used to further shrink my test corpus.

I sorted the valid from invalid inputs and checked them into the repository. Have a look at all the wacky inputs invented by the fuzzer starting from my single, minimal input:

This essentially locks down the parser, and the test suite ensures a particular build behaves in a very specific way. This is most useful for ensuring that builds on other platforms and by other compilers are indeed behaving identically with respect to their outputs. My test suite even revealed a bug in diet libc, as binitools doesn’t pass the tests when linked against it. If I were to make non-trivial changes to the parser, I’d essentially need to scrap the current test suite and start over, having afl generate an entire new corpus for the new parser.

Fuzzing has certainly proven itself to be a powerful technique. It found a number of bugs that I likely wouldn’t have otherwise discovered on my own. I’ve since gotten more savvy on its use and have used it on other software — not just software I’ve written myself — and discovered more bugs. It’s got a permanent slot on my software developer toolbelt.

A JavaScript Typed Array Gotcha

JavaScript’s prefix increment and decrement operators can be surprising when applied to typed arrays. It caught be by surprise when I was porting some C code over to JavaScript Just using your brain to execute this code, what do you believe is the value of r?

let array = new Uint8Array([255]);
let r = ++array[0];

The increment and decrement operators originated in the B programming language. Its closest living relative today is C, and, as far as these operators are concered, C can be considered an ancestor of JavaScript. So what is the value of r in this similar C code?

uint8_t array[] = {255};
int r = ++array[0];

Of course, if they were the same then there would be nothing to write about, so that should make it easier to guess if you aren’t sure. The answer: In JavaScript, r is 256. In C, r is 0.

What happened to me was that I wrote an 80-bit integer increment routine in C like this:

uint8_t array[10];
/* ... */
for (int i = 9; i >= 0; i--)
    if (++array[i])
        break;

But I was getting the wrong result over in JavaScript from essentially the same code:

let array = new Uint8Array(10);
/* ... */
for (let i = 9; i >= 0; i--)
    if (++array[i])
        break;

So what’s going on here?

JavaScript specification

The ES5 specification says this about the prefix increment operator:

Let expr be the result of evaluating UnaryExpression.

  1. Throw a SyntaxError exception if the following conditions are all true: [omitted]

  2. Let oldValue be ToNumber(GetValue(expr)).

  3. Let newValue be the result of adding the value 1 to oldValue, using the same rules as for the + operator (see 11.6.3).

  4. Call PutValue(expr, newValue).

Return newValue.

So, oldValue is 255. This is a double precision float because all numbers in JavaScript (outside of the bitwise operations) are double precision floating point. Add 1 to this value to get 256, which is newValue. When newValue is stored in the array via PutValue(), it’s converted to an unsigned 8-bit integer, which truncates it to 0.

However, newValue is returned, not the value that was actually stored in the array!

Since JavaScript is dynamically typed, this difference did not actually matter until typed arrays are involved. I suspect if typed arrays were in JavaScript from the beginning, the specified behavior would be more in line with C.

This behavior isn’t limited to the prefix operators. Consider assignment:

let array = new Uint8Array([255]);
let r = (array[0] = array[0] + 1);
let s = (array[0] += 1);

Both r and s will still be 256. The result of the assignment operators is a similar story:

LeftHandSideExpression = AssignmentExpression is evaluated as follows:

  1. Let lref be the result of evaluating LeftHandSideExpression.

  2. Let rref be the result of evaluating AssignmentExpression.

  3. Let rval be GetValue(rref).

  4. Throw a SyntaxError exception if the following conditions are all true: [omitted]

  5. Call PutValue(lref, rval).

  6. Return rval.

Again, the result of the expression is independent of how it was stored with PutValue().

C specification

I’ll be referencing the original C89/C90 standard. The C specification requires a little more work to get to the bottom of the issue. Starting with 3.3.3.1 (Prefix increment and decrement operators):

The value of the operand of the prefix ++ operator is incremented. The result is the new value of the operand after incrementation. The expression ++E is equivalent to (E+=1).

Later in 3.3.16.2 (Compound assignment):

A compound assignment of the form E1 op = E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.

Then finally in 3.3.16 (Assignment operators):

An assignment operator stores a value in the object designated by the left operand. An assignment expression has the value of the left operand after the assignment, but is not an lvalue.

So the result is explicitly the value after assignment. Let’s look at this step by step after rewriting the expression.

int r = (array[0] = array[0] + 1);

In C, all integer operations are performed with at least int precision. Smaller integers are implicitly promoted to int before the operation. The value of array[0] is 255, and, since uint8_t is smaller than int, it gets promoted to int. Additionally, the literal constant 1 is also an int, so there are actually two reasons for this promotion.

So since these are int values, the result of the addition is 256, like in JavaScript. To store the result, this value is then demoted to uint8_t and truncated to 0. Finally, this post-assignment 0 is the result of the expression, not the right-hand result as in JavaScript.

Specifications are useful

These situations are why I prefer programming languages that have a formal and approachable specification. If there’s no specification and I’m observing undocumented, idiosyncratic behavior, is this just some subtle quirk of the current implementation — e.g. something that might change without notice in the future — or is it intended behavior that I can rely upon for correctness?

null program

Chris Wellons