Articles tagged posix at null program

Conventions for Command Line Options

2020-08-01T00:34:23Z

This article was discussed on Hacker News and critiqued on Wandering Thoughts (2, 3).

Command line interfaces have varied throughout their brief history but have largely converged to some common, sound conventions. The core originates from unix, and the Linux ecosystem extended it, particularly via the GNU project. Unfortunately some tools initially appear to follow the conventions, but subtly get them wrong, usually for no practical benefit. I believe in many cases the authors simply didn’t know any better, so I’d like to review the conventions.

Short Options

The simplest case is the short option flag. An option is a hyphen — specifically HYPHEN-MINUS U+002D — followed by one alphanumeric character. Capital letters are acceptable. The letters themselves have conventional meanings and are worth following if possible.

program -a -b -c

Flags can be grouped together into one program argument. This is both convenient and unambiguous. It’s also one of those often missed details when programs use hand-coded argument parsers, and the lack of support irritates me.

program -abc
program -acb

The next simplest case are short options that take arguments. The argument follows the option.

program -i input.txt -o output.txt

The space is optional, so the option and argument can be packed together into one program argument. Since the argument is required, this is still unambiguous. This is another often-missed feature in hand-coded parsers.

program -iinput.txt -ooutput.txt

This does not prohibit grouping. When grouped, the option accepting an argument must be last.

program -abco output.txt
program -abcooutput.txt

This technique is used to create another category, optional option arguments. The option’s argument can be optional but still unambiguous so long as the space is always omitted when the argument is present.

program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"

Optional option arguments should be used judiciously since they can be surprising, but they have their uses.

Options can typically appear in any order — something parsers often achieve via permutation — but non-options typically follow options.

program -a -b foo bar
program -b -a foo bar

GNU-style programs usually allow options and non-options to be mixed, though I don’t consider this to be essential.

program -a foo -b bar
program foo -a -b bar
program foo bar -a -b

If a non-option looks like an option because it starts with a hyphen, use -- to demarcate options from non-options.

program -a -b -- -x foo bar

An advantage of requiring that non-options follow options is that the first non-option demarcates the two groups, so -- is less often needed.

# note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options

Long options

Since short options can be cryptic, and there are such a limited number of them, more complex programs support long options. A long option starts with two hyphens followed by one or more alphanumeric, lowercase words. Hyphens separate words. Using two hyphens prevents long options from being confused for grouped short options.

program --reverse --ignore-backups

Occasionally flags are paired with a mutually exclusive inverse flag that begins with --no-. This avoids a future flag day where the default is changed in the release that also adds the flag implementing the original behavior.

program --sort
program --no-sort

Long options can similarly accept arguments.

program --output output.txt --block-size 1024

These may optionally be connected to the argument with an equals sign =, much like omitting the space for a short option argument.

program --output=output.txt --block-size=1024

Like before, this opens up the doors for optional option arguments. Due to the required = this is still unambiguous.

program --color --reverse
program --color=never --reverse

The -- retains its original behavior of disambiguating option-like non-option arguments:

program --reverse -- --foo bar

Subcommands

Some programs, such as Git, have subcommands each with their own options. The main program itself may still have its own options distinct from subcommand options. The program’s options come before the subcommand and subcommand options follow the subcommand. Options are never permuted around the subcommand.

program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz

Above, the -a, -b, and -c options are for program, and the others are for subcommand. So, really, the subcommand is another command line of its own.

Option parsing libraries

There’s little excuse for not getting these conventions right assuming you’re interested in following the conventions. Short options can be parsed correctly in just ~60 lines of C code. Long options are just slightly more complex.

GNU’s getopt_long() supports long option abbreviation — with no way to disable it (!) — but this should be avoided.

Go’s flag package intentionally deviates from the conventions. It only supports long option semantics, via a single hyphen. This makes it impossible to support grouping even if all options are only one letter. Also, the only way to combine option and argument into a single command line argument is with =. It’s sound, but I miss both features every time I write programs in Go. That’s why I wrote my own argument parser. Not only does it have a nicer feature set, I like the API a lot more, too.

Python’s primary option parsing library is argparse, and I just can’t stand it. Despite appearing to follow convention, it actually breaks convention and its behavior is unsound. For instance, the following program has two options, --foo and --bar. The --foo option accepts an optional argument, and the --bar option is a simple flag.

import argparse
import sys

parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))

Here are some example runs:

$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')

Everything looks good except the last. If the --foo argument is optional then why did it consume arg? What happens if I follow it with --bar? Will it consume it as the argument?

$ python parse.py --foo --bar
Namespace(bar=True, foo=None)

Nope! Unlike arg, it left --bar alone, so instead of following the unambiguous conventions, it has its own ambiguous semantics and attempts to remedy them with a “smart” heuristic: “If an optional argument looks like an option, then it must be an option!” Non-option arguments can never follow an option with an optional argument, which makes that feature pretty useless. Since argparse does not properly support --, that does not help.

$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg

Please, stick to the conventions unless you have really good reasons to break them!

Fibers: the Most Elegant Windows API

2019-03-28T22:26:05Z

This article was discussed on Hacker News.

The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.

That’s why, when I recently revisited the Fibers API, I was pleasantly surprised. It’s one of the exceptions — much cleaner than the optional, deprecated, and now obsolete POSIX equivalent. It’s not quite an apples-to-apples comparison since the POSIX version is slightly more powerful, and more complicated as a result. I’ll cover the difference in this article.

For the last part of this article, I’ll walk through an async/await framework build on top of fibers. The framework allows coroutines in C programs to await on arbitrary kernel objects.

Fiber Async/await Demo

Fibers

Windows fibers are really just stackful, symmetric coroutines. From a different point of view, they’re cooperatively scheduled threads, which is the source of the analogous name, fibers. They’re symmetric because all fibers are equal, and no fiber is the “main” fiber. If any fiber returns from its start routine, the program exits. (Older versions of Wine will crash when this happens, but it was recently fixed.) It’s equivalent to the process’ main thread returning from main(). The initial fiber is free to create a second fiber, yield to it, then the second fiber destroys the first.

For now I’m going to focus on the core set of fiber functions. There are some additional capabilities I’m going to ignore, including support for fiber local storage. The important functions are just these five:

void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void  SwitchToFiber(void *fiber);
bool  ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void  DeleteFiber(void *fiber);

To emphasize its simplicity, I’ve shown them here with more standard prototypes than seen in their formal documentation. That documentation uses the clunky Windows API typedefs still burdened with its 16-bit heritage — e.g. LPVOID being a “long pointer” from the segmented memory of the 8086:

Fibers are represented using opaque, void pointers. Maybe that’s a little too simple since it’s easy to misuse in C, but I like it. The return values for CreateFiber() and ConvertThreadToFiber() are void pointers since these both create fibers.

The fiber start routine returns nothing and takes a void “user pointer”. That’s nearly what I’d expect, except that it would probably make more sense for a fiber to return int, which is more in line with main / WinMain / mainCRTStartup / WinMainCRTStartup. As I said, when any fiber returns from its start routine, it’s like returning from the main function, so it should probably have returned an integer.

A fiber may delete itself, which is the same as exiting the thread. However, a fiber cannot yield (e.g. SwitchToFiber()) to itself. That’s undefined behavior.

#include 
#include 
#include 

void
coup(void *king)
{
    puts("Long live the king!");
    DeleteFiber(king);
    ConvertFiberToThread(); /* seize the main thread */
    /* ... */
}

int
main(void)
{
    void *king = ConvertThreadToFiber(0);
    void *pretender = CreateFiber(0, coup, king);
    SwitchToFiber(pretender);
    abort(); /* unreachable */
}

Only fibers can yield to fibers, but when the program starts up, there are no fibers. At least one thread must first convert itself into a fiber using ConvertThreadToFiber(), which returns the fiber object that represents itself. It takes one argument analogous to the last argument of CreateFiber(), except that there’s no start routine to accept it. The process is reversed with ConvertFiberToThread().

Fibers don’t belong to any particular thread and can be scheduled on any thread if properly synchronized. Obviously one should never yield to the same fiber in two different threads at the same time.

Contrast with POSIX

The equivalent POSIX systems was context switching. It’s also stackful and symmetric, but it has just three important functions: getcontext(3), makecontext(3), and swapcontext.

int  getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int  swapcontext(ucontext_t *oucp, const ucontext_t *ucp);

These are roughly equivalent to GetCurrentFiber(), CreateFiber(), and SwitchToFiber(). There is no need for ConvertFiberToThread() since threads can context switch without preparation. There’s also no DeleteFiber() because the resources are managed by the program itself. That’s where POSIX contexts are a little bit more powerful.

The first argument to CreateFiber() is the desired stack size, with zero indicating the default stack size. The stack is allocated and freed by the operating system. The downside is that the caller doesn’t have a choice in managing the lifetime of this stack and how it’s allocated. If you’re frequently creating and destroying coroutines, those stacks are constantly being allocated and freed.

In makecontext(3), the caller allocates and supplies the stack. Freeing that stack is equivalent to destroying the context. A program that frequently creates and destroys contexts can maintain a stack pool or otherwise more efficiently manage their allocation. This makes it more powerful, but it also makes it a little more complicated. It would be hard to remember how to do all this without a careful reading of the documentation:

/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);

/* Destroy a context */
free(ctx.uc_stack.ss_sp);

Note how makecontext(3) is variadic (...), passing its arguments on to the start routine of the context. This seems like it might be better than a user pointer. Unfortunately it’s not, since those arguments are strictly limited to integers.

Ultimately I like the fiber API better. The first time I tried it out, I could guess my way through it without looking closely at the documentation.

Async / await with fibers

Why was I looking at the Fiber API? I’ve known about coroutines for years but I didn’t understand how they could be useful. Sure, the function can yield, but what other coroutine should it yield to? It wasn’t until I was recently bit by the async/await bug that I finally saw a “killer feature” that justified their use. Generators come pretty close, though.

Windows fibers are a coroutine primitive suitable for async/await in C programs, where it can also be useful. To prove that it’s possible, I built async/await on top of fibers in 95 lines of code.

The alternatives are to use a third-party coroutine library or to do it myself with some assembly programming. However, having it built into the operating system is quite convenient! It’s unfortunate that it’s limited to Windows. Ironically, though, everything I wrote for this article, including the async/await demonstration, was originally written on Linux using Mingw-w64 and tested using Wine. Only after I was done did I even try it on Windows.

Before diving into how it works, there’s a general concept about the Windows API that must be understood: All kernel objects can be in either a signaled or unsignaled state. The API provides functions that block on a kernel object until it is signaled. The two important ones are WaitForSingleObject() and WaitForMultipleObjects(). The latter behaves very much like poll(2) in POSIX.

Usually the signal is tied to some useful event, like a process or thread exiting, the completion of an I/O operation (i.e. asynchronous overlapped I/O), a semaphore being incremented, etc. It’s a generic way to wait for some event. However, instead of blocking the thread, wouldn’t it be nice to await on the kernel object? In my aio library for Emacs, the fundamental “wait” object was a promise. For this API it’s a kernel object handle.

So, the await function will take a kernel object, register it with the scheduler, then yield to the scheduler. The scheduler — which is a global variable, so there’s only one scheduler per process — looks like this:

struct {
    void *main_fiber;
    HANDLE handles[MAXIMUM_WAIT_OBJECTS];
    void *fibers[MAXIMUM_WAIT_OBJECTS];
    void *dead_fiber;
    int count;
} async_loop;

While fibers are symmetric, coroutines in my async/await implementation are not. One fiber is the scheduler, main_fiber, and the other fibers always yield to it.

There is an array of kernel object handles, handles, and an array of fibers. The elements in these arrays are paired with each other, but it’s convenient to store them separately, as I’ll show soon. fibers[0] is waiting on handles[0], and so on.

The array is a fixed size, MAXIMUM_WAIT_OBJECTS (64), because there’s a hard limit on the number of fibers that can wait at once. This pathetically small limitation is an unfortunate, hard-coded restriction of the Windows API. It kills most practical uses of my little library. Fortunately there’s no limit on the number of handles we might want to wait on, just the number of co-existing fibers.

When a fiber is about to return from its start routine, it yields one last time and registers itself on the dead_fiber member. The scheduler will delete this fiber as soon as it’s given control. Fibers never truly return since that would terminate the program.

With this, the await function, async_await(), is pretty simple. It registers the handle with the scheduler, then yields to the scheduler fiber.

void
async_await(HANDLE h)
{
    async_loop.handles[async_loop.count] = h;
    async_loop.fibers[async_loop.count] = GetCurrentFiber();
    async_loop.count++;
    SwitchToFiber(async_loop.main_fiber);
}

Caveat: The scheduler destroys this handle with CloseHandle() after it signals, so don’t try to reuse it. This made my demonstration simpler, but it might be better to not do this.

A fiber can exit at any time. Such an exit is inserted implicitly before a fiber actually returns:

void
async_exit(void)
{
    async_loop.dead_fiber = GetCurrentFiber();
    SwitchToFiber(async_loop.main_fiber);
}

The start routine given to async_start() is actually wrapped in the real start routine. This is how async_exit() is injected:

struct fiber_wrapper {
    void (*func)(void *);
    void *arg;
};

static void
fiber_wrapper(void *arg)
{
    struct fiber_wrapper *fw = arg;
    fw->func(fw->arg);
    async_exit();
}

int
async_start(void (*func)(void *), void *arg)
{
    if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
        return 0;
    } else {
        struct fiber_wrapper fw = {func, arg};
        SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
        return 1;
    }
}

The library provides a single awaitable function, async_sleep(). It creates a “waitable timer” object, starts the countdown, and returns it. (Notice how SetWaitableTimer() is a typically-ugly Win32 function with excessive parameters.)

HANDLE
async_sleep(double seconds)
{
    HANDLE promise = CreateWaitableTimer(0, 0, 0);
    LARGE_INTEGER t;
    t.QuadPart = (long long)(seconds * -10000000.0);
    SetWaitableTimer(promise, &t, 0, 0, 0, 0);
    return promise;
}

A more realistic example would be overlapped I/O. For example, you’d open a file (CreateFile()) in overlapped mode, then when you, say, read from that file (ReadFile()) you create an event object (CreateEvent()), populate an overlapped I/O structure with the event, offset, and length, then finally await on the event object. The fiber will be resumed when the operation is complete.

Side note: Unfortunately overlapped I/O doesn’t work correctly for files, and many operations can’t be done asynchronously, like opening files. When it comes to files, you’re better off using dedicated threads as libuv does instead of overlapped I/O. You can still await on these operations. You’d just await on the signal from the thread doing synchronous I/O, not from overlapped I/O.

The most complex part is the scheduler, and it’s really not complex at all:

void
async_run(void)
{
    while (async_loop.count) {
        /* Wait for next event */
        DWORD nhandles = async_loop.count;
        HANDLE *handles = async_loop.handles;
        DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);

        /* Remove event and fiber from waiting array */
        void *fiber = async_loop.fibers[r];
        CloseHandle(async_loop.handles[r]);
        async_loop.handles[r] = async_loop.handles[nhandles - 1];
        async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
        async_loop.count--;

        /* Run the fiber */
        SwitchToFiber(fiber);

        /* Destroy the fiber if it exited */
        if (async_loop.dead_fiber) {
            DeleteFiber(async_loop.dead_fiber);
            async_loop.dead_fiber = 0;
        }
    }
}

This is why the handles are in their own array. The array can be passed directly to WaitForMultipleObjects(). The return value indicates which handle was signaled. The handle is closed, the entry removed from the scheduler, and then the fiber is resumed.

That WaitForMultipleObjects() is what limits the number of fibers. It’s not possible to wait on more than 64 handles at once! This is hard-coded into the API. How? A return value of 64 is an error code, and changing this would break the API. Remember what I said about being locked into bad design decisions of the past?

To be fair, WaitForMultipleObjects() was a doomed API anyway, just like select(2) and poll(2) in POSIX. It scales very poorly since the entire array of objects being waited on must be traversed on each call. That’s terribly inefficient when waiting on large numbers of objects. This sort of problem is solved by interfaces like kqueue (BSD), epoll (Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this particular problem well — awaiting on kernel objects — so I couldn’t use it.

When the awaiting fiber count is zero and the scheduler has control, all fibers must have completed and there’s nothing left to do. However, the caller can schedule more fibers and then restart the scheduler if desired.

That’s all there is to it. Have a look at demo.c to see how the API looks in some trivial examples. On Linux you can see it in action with make check. On Windows, you just need to compile it, then run it like a normal program. If there was a better function than WaitForMultipleObjects() in the Windows API, I would have considered turning this demonstration into a real library.

Endlessh: an SSH Tarpit

2019-03-22T17:26:45Z

This article was discussed on Hacker News (later), on reddit (also), featured in BSD Now 294. Also check out this Endlessh analysis.

I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.

The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost

Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.

How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF

The RFC also notes:

The server MAY send other lines of data before sending the version string.

There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.

This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a fork server: accept a connection, fork, and the child simply alternates between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}

A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a single-threaded poll(2) server, analogous to stackless green threads. The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their only overhead is the socket object in the kernel, and another 78 bytes to track them in Endlessh. Most of those bytes are used only for accurate logging. Only those clients that are overdue for a new line are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is essentially used in place of sleep(3). Though since it still needs to manage the accept server socket, it (almost) never actually waits on nothing.

There’s an option to limit the total number of client connections so that it doesn’t get out of hand. In this case it will stop polling the accept socket until a client disconnects. I probably shouldn’t have bothered with this option and instead relied on ulimit, a feature already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much more efficient than poll(2). The problem with poll(2) is that it’s constantly registering and unregistering Endlessh on each of the overdue sockets each time around the main loop. This is by far the most CPU-intensive part of Endlessh, and it’s all inflicted on the kernel. Most of the time, even with thousands of clients trapped in the tarpit, only a small number of them at polled at once, so I opted for better portability instead.

One consequence of not polling connections that are waiting is that disconnections aren’t noticed in a timely fashion. This makes the logs less accurate than I like, but otherwise it’s pretty harmless. Unforunately even if I wanted to fix this, the poll(2) interface isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the kernel, where it allocates send and receive buffers for each client and manages the proper TCP state. The next step to reducing this overhead is Endlessh opening a raw socket and speaking TCP itself, bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asyncio, and it’s actually a near perfect fit for this problem. I should have just used it in the first place. The hard part is already implemented within asyncio, and the problem isn’t CPU-bound, so being written in Python doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asyncio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Try it out for yourself. Firefox and Chrome will spin on that server for hours before giving up. I have yet to see curl actually timeout on its own in the default settings (--max-time/-m does work correctly, though).

Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.

A JIT Compiler Skirmish with SELinux

2018-11-15T18:57:47Z

This is a debugging war story.

Once upon a time I wrote a fancy data conversion utility. The input was a complex binary format defined by a data dictionary supplied at run time by the user alongside the input data. Since the converter was typically used to process massive quantities of input, and the nature of that input wasn’t known until run time, I wrote an x86-64 JIT compiler to speed it up. The converter generated a fast, native binary parser in memory according to the data dictionary specification. Processing data now took much less time and everyone rejoiced.

Then along came SELinux, Sheriff of Pedantry. Not liking all the shenanigans with page protections, SELinux huffed and puffed and made mprotect(2) return EACCES (“Permission denied”). Believing I was following all the rules and so this would never happen, I foolishly did not check the result and the converter was now crashing for its users. What made SELinux so unhappy, and could this somehow be resolved?

Allocating memory

Before going further, let’s back up and review how this works. Suppose I want to generate code at run time and execute it. In the old days this was as simple as writing some machine code into a buffer and jumping to that buffer — e.g. by converting the buffer to a function pointer and calling it.

typedef int (*jit_func)(void);

/* NOTE: This doesn't work anymore! */
jit_func
jit_compile(int retval)
{
    unsigned char *buf = malloc(6);
    if (buf) {
        /* mov eax, retval */
        buf[0] = 0xb8;
        buf[1] = retval >>  0;
        buf[2] = retval >>  8;
        buf[3] = retval >> 16;
        buf[4] = retval >> 24;
        /* ret */
        buf[5] = 0xc3;
    }
    return (jit_func)buf;
}

int
main(void)
{
    jit_func f = jit_compile(1001);
    printf("f() = %d\n", f());
    free(f);
}

This situation was far too easy for malicious actors to abuse. An attacker could supply instructions of their own choosing — i.e. shell code — as input and exploit a buffer overflow vulnerability to execute the input buffer. These exploits were trivial to craft.

Modern systems have hardware checks to prevent this from happening. Memory containing instructions must have their execute protection bit set before those instructions can be executed. This is useful both for making attackers work harder and for catching bugs in programs — no more executing data by accident.

This is further complicated by the fact that memory protections have page granularity. You can’t adjust the protections for a 6-byte buffer. You do it for the entire surrounding page — typically 4kB, but sometimes as large as 2MB. This requires replacing that malloc(3) with a more careful allocation strategy. There are a few ways to go about this.

Anonymous memory mapping

The most common and most sensible is to create an anonymous memory mapping: a file memory map that’s not actually backed by a file. The mmap(2) function has a flag specifically for this purpose: MAP_ANONYMOUS.

#include 

void *
anon_alloc(size_t len)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, -1, 0);
    return p != MAP_FAILED ? p : 0;
}

void
anon_free(void *p, size_t len)
{
    munmap(p, len);
}

Unfortunately, MAP_ANONYMOUS not part of POSIX. If you’re being super strict with your includes — as I tend to be — this flag won’t be defined, even on systems where it’s supported.

#define _POSIX_C_SOURCE 200112L
#include 
// MAP_ANONYMOUS undefined!

To get the flag, you must use the _BSD_SOURCE, or, more recently, the _DEFAULT_SOURCE feature test macro to explicitly enable that feature.

#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE /* for MAP_ANONYMOUS */
#include 

The POSIX way to do this is to instead map /dev/zero. So, wanting to be Mr. Portable, this is what I did in my tool. Take careful note of this.

#define _POSIX_C_SOURCE 200112L
#include 
#include 
#include 

void *
anon_alloc(size_t len)
{
    int fd = open("/dev/zero", O_RDWR);
    if (fd == -1)
        return 0;
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

Aligned allocation

Another, less common (and less portable) strategy is to lean on the existing C memory allocator, being careful to allocate on page boundaries so that the page protections don’t affect other allocations. The classic allocation functions, like malloc(3), don’t allow for this kind of control. However, there are a couple of aligned allocation alternatives.

The first is posix_memalign(3):

int posix_memalign(void **ptr, size_t alignment, size_t size);

By choosing page alignment and a size that’s a multiple of the page size, it’s guaranteed to return whole pages. When done, pages are freed with free(3). Though, unlike unmapping, the original page protections must first be restored since those pages may be reused.

#define _POSIX_C_SOURCE 200112L
#include 
#include 

void *
anon_alloc(size_t len)
{
    void *p;
    long pagesize = sysconf(_SC_PAGE_SIZE); // TODO: cache this
    size_t roundup = (len + pagesize - 1) / pagesize * pagesize;
    return posix_memalign(&p, pagesize, roundup) ? 0 : p;
}

If you’re using C11, there’s also aligned_alloc(3). This is the most uncommon of all since most C programmers refuse to switch to a new standard until it’s at least old enough to drive a car.

Changing page protections

So we’ve allocated our memory, but it’s not going to start in an executable state. Why? Because a W^X (“write xor execute”) policy is becoming increasingly common. Attempting to set both write and execute protections at the same time may be denied. (In fact, there’s an SELinux policy for this.)

As a JIT compiler, we need to write to a page and execute it. Again, there are two strategies. The complicated strategy is to map the same memory at two different places, one with the execute protection, one with the write protection. This allows the page to be modified as it’s being executed without violating W^X.

The simpler and more secure strategy is to write the machine instructions, then swap the page over to executable using mprotect(2) once it’s ready. This is what I was doing in my tool.

unsigned char *buf = anon_alloc(len);
/* ... write instructions into the buffer ... */
mprotect(buf, len, PROT_EXEC);
jit_func func = (jit_func)buf;
func();

At a high level, That’s pretty close to what I was actually doing. That includes neglecting to check the result of mprotect(2). This worked fine and dandy for several years, when suddenly (shown here in the style of strace):

mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)

Then the program would crash trying to execute the buffer. Suddenly it wasn’t allowed to make this buffer executable. My program hadn’t changed. What had changed was the SELinux security policy on this particular system.

Asking for help

The problem is that I don’t administer this (Red Hat) system. I can’t access the logs and I didn’t set the policy. I don’t have any insight on why this call was suddenly being denied. To make this more challenging, the folks that manage this system didn’t have the necessary knowledge to help with this either.

So to figure this out, I need to treat it like a black box and probe at system calls until I can figure out just what SELinux policy I’m up against. I only have practical experience administrating Debian systems (and its derivatives like Ubuntu), which means I’ve hardly ever had to deal with SELinux. I’m flying fairly blind here.

Since my real application is large and complicated, I code up a minimal example, around a dozen lines of code: allocate a single page of memory, write a single return (ret) instruction into it, set it as executable, and call it. The program checks for errors, and I can run it under strace if that’s not insightful enough. This program is also something simple I could provide to the system administrators, since they were willing to turn some of the knobs to help narrow down the problem.

However, here’s where I made a major mistake. Assuming the problem was solely in mprotect(2), and wanting to keep this as absolutely simple as possible, I used posix_memalign(3) to allocate that page. I saw the same EACCES as before, and assumed I was demonstrating the same problem. Take note of this, too.

Finding a resolution

Eventually I’d need to figure out what policy was blocking my JIT compiler, then see if there was an alternative route. The system loader still worked after all, and I could plainly see that with strace. So it wasn’t a blanket policy that completely blocked the execute protection. Perhaps the loader was given an exception?

However, the very first order of business was to actually check the result from mprotect(2) and do something more graceful rather than crash. In my case, that meant falling back to executing a byte-code virtual machine. I added the check, and now the program ran slower instead of crashing.

The program runs on both Linux and Windows, and the allocation and page protection management is abstracted. On Windows it uses VirtualAlloc() and VirtualProtect() instead of mmap(2) and mprotect(2). Neither implementation checked that the protection change succeeded, so I fixed the Windows implementation while I was at it.

Thanks to Mingw-w64, I actually do most of my Windows development on Linux. And, thanks to Wine, I mean everything, including running and debugging. Calling VirtualProtect() in Wine would ultimately call mprotect(2) in the background, which I expected would be denied. So running the Windows version with Wine under this SELinux policy would be the perfect test. Right?

Except that mprotect(2) succeeded under Wine! The Windows version of my JIT compiler was working just fine on Linux. Huh?

This system doesn’t have Wine installed. I had built and packaged it myself. This Wine build definitely has no SELinux exceptions. Not only did the Wine loader work correctly, it can change page protections in ways my own Linux programs could not. What’s different?

Debugging this with all these layers is starting to look silly, but this is exactly why doing Windows development on Linux is so useful. I run my program under Wine under strace:

$ strace wine ./mytool.exe

I study the system calls around mprotect(2). Perhaps there’s some stricter alignment issue? No. Perhaps I need to include PROT_READ? No. The only difference I can find is they’re using the MAP_ANONYMOUS flag. So, armed with this knowledge, I modify my minimal example to allocate 1024 pages instead of just one, and suddenly it works correctly. I was most of the way to figuring this all out.

Inside glibc allocation

Why did increasing the allocation size change anything? This is a typical Linux system, so my program is linked against the GNU C library, glibc. This library allocates memory from two places depending on the allocation size.

For small allocations, glibc uses brk(2) to extend the executable image — i.e. to extend the .bss section. These resources are not returned to the operating system after they’re freed with free(3). They’re reused.

For large allocations, glibc uses mmap(2) to create a new, anonymous mapping for that allocation. When freed with free(3), that memory is unmapped and its resources are returned to the operating system.

By increasing the allocation size, it became a “large” allocation and was backed by an anonymous mapping. Even though I didn’t use mmap(2), to the operating system this would be indistinguishable to what Wine was doing (and succeeding at).

Consider this little example program:

int
main(void)
{
    printf("%p\n", malloc(1));
    printf("%p\n", malloc(1024 * 1024));
}

When not compiled as a Position Independent Executable (PIE), here’s what the output looks like. The first pointer is near where the program was loaded, low in memory. The second pointer is a randomly selected address high in memory.

0x1077010
0x7fa9b998e010

And if you run it under strace, you’ll see that the first allocation comes from brk(2) and the second comes from mmap(2).

Two SELinux policies

With a little bit of research, I found the two SELinux policies at play here. In my minimal example, I was blocked by allow_execheap.

/selinux/booleans/allow_execheap

This prohibits programs from setting the execute protection on any “heap” page.

The POSIX specification does not permit it, but the Linux implementation of mprotect allows changing the access protection of memory on the heap (e.g., allocated using malloc). This error indicates that heap memory was supposed to be made executable. Doing this is really a bad idea. If anonymous, executable memory is needed it should be allocated using mmap which is the only portable mechanism.

Obviously this is pretty loose since I was still able to do it with posix_memalign(3), which, technically speaking, allocates from the heap. So this policy applies to pages mapped by brk(2).

The second policy was allow_execmod.

/selinux/booleans/allow_execmod

The program mapped from a file with mmap and the MAP_PRIVATE flag and write permission. Then the memory region has been written to, resulting in copy-on-write (COW) of the affected page(s). This memory region is then made executable […]. The mprotect call will fail with EACCES in this case.

I don’t understand what purpose this policy serves, but this is what was causing my original problem. Pages mapped to /dev/zero are not actually considered anonymous by Linux, at least as far as this policy is concerned. I think this is a mistake, and that mapping the special /dev/zero device should result in effectively anonymous pages.

From this I learned a little lesson about baking assumptions — that mprotect(2) was solely at fault — into my minimal debugging examples. And the fix was ultimately easy: I just had to suck it up and use the slightly less pure MAP_ANONYMOUS flag.

A Crude Personal Package Manager

2018-03-27T02:10:35Z

For the past couple of months I’ve been using a custom package manager to manage a handful of software packages within various unix-like environments. Packages are installed in my home directory under ~/.local/bin, and the package manager itself is just a 110 line Bourne shell script. It’s is not intended to replace the system’s package manager but, instead, compliment it in some cases where I need more flexibility. I use it to run custom versions of specific pieces of software — newer or older than the system-installed versions, or with my own patches and modifications — without interfering with the rest of system, and without a need for root access. It’s worked out really well so far and I expect to continue making heavy use of it in the future.

It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):

https://github.com/skeeto/dotfiles/blob/master/bin/qpkg

Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.

I say it’s crude because its “install” (-I) procedure is little more than a wrapper around tar. It doesn’t invoke libtool after installing a library, and there’s no post-install script — or postinst as Debian calls it. It doesn’t check for conflicts between packages, though there’s a command for doing so manually ahead of time. It doesn’t manage dependencies, nor even have them as a concept. That’s all on the user to screw up.

In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:

It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.
Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.
It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.

The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.

The second caveat is that some builds tailor themselves by default to the host (e.g. -march=native). If care isn’t taken, those packages may not be very portable. This is more common than I had expected and has mildly annoyed me.

Birth of a package manager

While the package manager is new, I’ve been building and installing software in my home directory for years. I’d follow the normal process of setting the install prefix to $HOME/.local, running the build, and then letting the “install” target do its thing.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.

A common difficulty has been handling the release of new versions of software. I’d like to upgrade to the new version, but lack a way to cleanly uninstall the previous version. Simply clobbering the old version by installing it on top usually works. Occasionally it wouldn’t, and I’d have to blow away ~/.local and start all over again. With more and more software installed in my home directory, restarting has become more and more of a chore that I’d like to avoid.

What I needed was a way to track exactly which files were installed so that I could remove them later when I needed to uninstall. Fortunately there’s a widely-used convention for exactly this purpose: DESTDIR.

It’s expected that when a Makefile provides an “install” target, it prefixes the installation path with the DESTDIR macro, which is assigned to the empty string by default. This allows the user to install the software to a temporary location for the purposes of packaging. Unlike the installation prefix (--prefix) configured before the build began, the software is not expected to function properly when run in the DESTDIR location.

$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install

A different tool will used to copy these files into place and actually install it. This tool can track what files were installed, allowing them to be removed later when uninstalling. My package manager uses the tar program for both purposes. First it creates a package by packing up the DESTDIR (at the root of the actual install prefix):

$ tar czf package.tgz -C $DESTDIR$HOME/.local .

So a package is nothing more than a gzipped tarball. To install, it unpacks the tarball in ~/.local.

$ cd $HOME/.local
$ tar xzf ~/package.tgz

But how does it uninstall a package? It didn’t keep track of what was installed. Easy! The tarball itself contains the package list, and it’s printed with tar’s t mode.

cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done

I’m using grep to skip directories, which are conveniently listed with a trailing slash. Note that in the example above, there are a couple of issues with file names containing whitespace. If the file contains a space character, it will word split incorrectly in the for loop. A Makefile couldn’t handle such a file in the first place, but, in case it’s still necessary, my package manager sets IFS to just a newline.

If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.

Commands

There are five commands, each assigned to a capital letter: -B, -C, -I, -V, and -U. It’s an interface pattern inspired by Ted Unangst’s signify (see signify(1)). I also used this pattern with Blowpipe and, in retrospect, wish I had also used with Enchive.

Build (`-B`)

Unlike the other three commands, the “build” command isn’t essential, and is just for convenience. It assumes the build uses an Autoconfg-like configure script and runs it automatically, followed by make with the appropriate -j (jobs) option. It automatically sets the --prefix argument when running the configure script.

If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.

Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/

In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.

Once qpkg returns, the program will be fully built — or stuck on a build error if you’re unlucky. If you need to pass custom configure options, just tack them on the qpkg command:

$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses

Since the second and third steps — creating the build directory and moving into it — is so common, there’s an optional switch for it: -d. This option’s argument is the build directory. qpkg creates that directory and runs the build inside it. In practice I just use “x” for the build directory since it’s so quick to add “dx” to the command.

$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/

With the software compiled, the next step is creating the package.

Create (`-C`)

The “create” command creates the DESTDIR (_destdir in the working directory) and runs the “install” Makefile target to fill it with files. Continuing with the example above and its x/ build directory:

$ qpkg -Cdx name

Where “name” is the name of the package, without any file name extension. Like with “build”, extra arguments after the package name are passed to make in case there needs to be any additional tweaking.

When the “create” command finishes, there will be new package named name.tgz in the working directory. At this point the source and build directories are no longer needed, assuming everything went fine.

$ rm -rf name-version/
$ rm -rf x/

This package is ready to install, though you may want to verify it first.

Verify (`-V`)

The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.

$ qpkg -V name.tgz

The most common conflict I’ve seen is in the info index (info/dir) file, which is safe to ignore since I don’t care about it.

If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.

Install (`-I`)

The “install” command is just the dumb tar xzf explained above. It will clobber anything in its way without warning, which is why, if that matters, “verify” should be used first.

$ qpkg -I name.tgz

When qpkg returns, the package has been installed and is probably ready to go. A lot of packages complain that you need to run libtool to finalize an installation, but I’ve never had a problem skipping it. This dumb unpacking generally works fine.

Uninstall (`-U`)

Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.

$ qpkg -U name.tgz

Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.

I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.

Package by accumulation

Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):

binutils
C headers
core GCC
C runtime
rest of GCC

Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)

To deal with this, I added a “keep” (-k) option that leaves the DESTDIR around after creating the package. To keep things tidy, the intermediate packages exist and are installed, but the final, big cross-compiler package accumulates into the DESTDIR. The final package at the end is actually the whole cross compiler in one package, a superset of them all.

Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.

My use case, and an alternative

The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.

All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.

Blowpipe: a Blowfish-encrypted, Authenticated Pipe

2017-09-15T23:59:59Z

Blowpipe is a toy crypto tool that creates a Blowfish-encrypted pipe. It doesn’t open any files and instead encrypts and decrypts from standard input to standard output. This pipe can encrypt individual files or even encrypt a network connection (à la netcat).

Most importantly, since Blowpipe is intended to be used as a pipe (duh), it will never output decrypted plaintext that hasn’t been authenticated. That is, it will detect tampering of the encrypted stream and truncate its output, reporting an error, without producing the manipulated data. Some very similar tools that aren’t considered toys lack this important feature, such as aespipe.

Purpose

Blowpipe came about because I wanted to study Blowfish, a 64-bit block cipher designed by Bruce Schneier in 1993. It’s played an important role in the history of cryptography and has withstood cryptanalysis for 24 years. Its major weakness is its small block size, leaving it vulnerable to birthday attacks regardless of any other property of the cipher. Even in 1993 the 64-bit block size was a bit on the small side, but Blowfish was intended as a drop-in replacement for the Data Encryption Standard (DES) and the International Data Encryption Algorithm (IDEA), other 64-bit block ciphers.

The main reason I’m calling this program a toy is that, outside of legacy interfaces, it’s simply not appropriate to deploy a 64-bit block cipher in 2017. Blowpipe shouldn’t be used to encrypt more than a few tens of GBs of data at a time. Otherwise I’m fairly confident in both my message construction and my implementation. One detail is a little uncertain, and I’ll discuss it later when describing message format.

A tool that I am confident about is Enchive, though since it’s intended for file encryption, it’s not appropriate for use as a pipe. It doesn’t authenticate until after it has produced most of its output. Enchive does try its best to delete files containing unauthenticated output when authentication fails, but this doesn’t prevent you from consuming this output before it can be deleted, particularly if you pipe the output into another program.

Usage

As you might expect, there are two modes of operation: encryption (-E) and decryption (-D). The simplest usage is encrypting and decrypting a file:

$ blowpipe -E < data.gz > data.gz.enc
$ blowpipe -D < data.gz.enc | gunzip > data.txt

In both cases you will be prompted for a passphrase which can be up to 72 bytes in length. The only verification for the key is the first Message Authentication Code (MAC) in the datastream, so Blowpipe cannot tell the difference between damaged ciphertext and an incorrect key.

In a script it would be smart to check Blowpipe’s exit code after decrypting. The output will be truncated should authentication fail somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t clean up for you.

Another use case is securely transmitting files over a network with netcat. In this example I’ll use a pre-shared key file, keyfile. Rather than prompt for a key, Blowpipe will use the raw bytes of a given file. Here’s how I would create a key file:

$ head -c 32 /dev/urandom > keyfile

First the receiver listens on a socket (bind(2)):

$ nc -lp 2000 | blowpipe -D -k keyfile > data.zip

Then the sender connects (connect(2)) and pipes Blowpipe through:

$ blowpipe -E -k keyfile < data.zip | nc -N hostname 2000

If all went well, Blowpipe will exit with 0 on the receiver side.

Blowpipe doesn’t buffer its output (but see -w). It performs one read(2), encrypts whatever it got, prepends a MAC, and calls write(2) on the result. This means it can comfortably transmit live sensitive data across the network:

$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000

Kernel messages will appear on the other end as they’re produced by dmesg. Though keep in mind that the size of each line will be known to eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to disguise the length. Those lengths may leak useful information.

Blowfish

This whole project started when I wanted to play with Blowfish as a small drop-in library. I wasn’t satisfied with the selection, so I figured it would be a good exercise to write my own. Besides, the specification is both an enjoyable and easy read (and recommended). It justifies the need for a new cipher and explains the various design decisions.

I coded from the specification, including writing a script to generate the subkey initialization tables. Subkeys are initialized to the binary representation of pi (the first ~10,000 decimal digits). After a couple hours of work I hooked up the official test vectors to see how I did, and all the tests passed on the first run. This wasn’t reasonable, so I spent awhile longer figuring out how I screwed up my tests. Turns out I absolutely nailed it on my first shot. It’s a really great sign for Blowfish that it’s so easy to implement correctly.

Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of storage. The key schedule is unusually complex: Subkeys are repeatedly encrypted with themselves as they are being computed. This complexity inspired the bcrypt password hashing scheme, which essentially works by iterating the key schedule many times in a loop, then encrypting a constant 24-byte string. My bcrypt implementation wasn’t nearly as successful on my first attempt, and it took hours of debugging in order to match OpenBSD’s outputs.

The encryption and decryption algorithms are nearly identical, as is typical for, and a feature of, Feistel ciphers. There are no branches (preventing some side-channel attacks), and the only operations are 32-bit XOR and 32-bit addition. This makes it ideal for implementation on 32-bit computers.

One tricky point is that encryption and decryption operate on a pair of 32-bit integers (another giveaway that it’s a Feistel cipher). To put the cipher to practical use, these integers have to be serialized into a byte stream. The specification doesn’t choose a byte order, even for mixing the key into the subkeys. The official test vectors are also 32-bit integers, not byte arrays. An implementer could choose little endian, big endian, or even something else.

However, there’s one place in which this decision is formally made: the official test vectors mix the key into the first subkey in big endian byte order. By luck I happened to choose big endian as well, which is why my tests passed on the first try. OpenBSD’s version of bcrypt also uses big endian for all integer encoding steps, further cementing big endian as the standard way to encode Blowfish integers.

Blowfish library

The Blowpipe repository contains a ready-to-use, public domain Blowfish library written in strictly conforming C99. The interface is just three functions:

void blowfish_init(struct blowfish *, const void *key, int len);
void blowfish_encrypt(struct blowfish *, uint32_t *, uint32_t *);
void blowfish_decrypt(struct blowfish *, uint32_t *, uint32_t *);

Technically the key can be up to 72 bytes long, but the last 16 bytes have an incomplete effect on the subkeys, so only the first 56 bytes should matter. Since bcrypt runs the key schedule multiple times, all 72 bytes have full effect.

The library also includes a bcrypt implementation, though it will only produce the raw password hash, not the base-64 encoded form. The main reason for including bcrypt is to support Blowpipe.

Message format

The main goal of Blowpipe was to build a robust, authenticated encryption tool using only Blowfish as a cryptographic primitive.

It uses bcrypt with a moderately-high cost as a key derivation function (KDF). Not terrible, but this is not a memory hard KDF, which is important for protecting against cheap hardware brute force attacks.
Encryption is Blowfish in “counter” CTR mode. A 64-bit counter is incremented and encrypted, producing a keystream. The plaintext is XORed with this keystream like a stream cipher. This allows the last block to be truncated when output and eliminates some padding issues. Since CRT mode is trivially malleable, the MAC becomes even more important. In CTR mode, blowfish_decrypt() is never called. In fact, Blowpipe never uses it.
The authentication scheme is Blowfish-CBC-MAC with a unique key and encrypt-then-authenticate (something I harmlessly got wrong with Enchive). It essentially encrypts the ciphertext again with a different key, but in Cipher Block Chaining mode (CBC), but it only saves the final block. The final block is prepended to the ciphertext as the MAC. On decryption the same block is computed again to ensure that it matches. Only someone who knows the MAC key can compute it.

Of all three Blowfish uses, I’m least confident about authentication. CBC-MAC is tricky to get right, though I am following the rules: fixed length messages using a different key than encryption.

Wait a minute. Blowpipe is pipe-oriented and can output data without buffering the entire pipe. How can there be fixed-length messages?

The pipe datastream is broken into 64kB chunks. Each chunk is authenticated with its own MAC. Both the MAC and chunk length are written in the chunk header, and the length is authenticated by the MAC. Furthermore, just like the keystream, the MAC is continued from previous chunk, preventing chunks from being reordered. Blowpipe can output the content of a chunk and discard it once it’s been authenticated. If any chunk fails to authenticate, it aborts.

This also leads to another useful trick: The pipe is terminated with a zero length chunk, preventing an attacker from appending to the datastream. Everything after the zero-length chunk is discarded. Since the length is authenticated by the MAC, the attacker also cannot truncate the pipe since that would require knowledge of the MAC key.

The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1 byte for the bcrypt cost. The salt is like an initialization vector (IV) that allows keys to be safely reused in different Blowpipe instances. The cost byte is the only distinguishing byte in the stream. Since even the chunk lengths are encrypted, everything else in the datastream should be indistinguishable from random data.

Portability

Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I initially wrote it for POSIX (on Linux) of course, but I took an unusual approach when it came time to port it to Windows. Normally I’d invent a generic OS interface that makes the appropriate host system calls. This time I kept the POSIX interface (read(2), write(2), open(2), etc.) and implemented the tiny subset of POSIX that I needed in terms of Win32. That implementation can be found under w32-compat/. I even dropped in a copy of my own getopt().

One really cool feature of this technique is that, on Windows, Blowpipe will still “open” /dev/urandom. It’s intercepted by my own open(2), which in response to that filename actually calls CryptAcquireContext() and pretends like it’s a file. It’s all hidden behind the file descriptor. That’s the unix way.

I’m considering giving Enchive the same treatment since it would simply and reduce much of the interface code. In fact, this project has taught me a number of ways that Enchive could be improved. That’s the value of writing “toys” such as Blowpipe.

A Tutorial on Portable Makefiles

2017-08-20T03:03:51Z

In my first decade writing Makefiles, I developed the bad habit of liberally using GNU Make’s extensions. I didn’t know the line between GNU Make and the portable features guaranteed by POSIX. Usually it didn’t matter much, but it would become an annoyance when building on non-Linux systems, such as on the various BSDs. I’d have to specifically install GNU Make, then remember to invoke it (i.e. as gmake) instead of the system’s make.

I’ve since become familiar and comfortable with make’s official specification, and I’ve spend the last year writing strictly portable Makefiles. Not only has are my builds now portable across all unix-like systems, my Makefiles are cleaner and more robust. Many of the common make extensions — conditionals in particular — lead to fragile, complicated Makefiles and are best avoided anyway. It’s important to be able to trust your build system to do its job correctly.

This tutorial should be suitable for make beginners who have never written their own Makefiles before, as well as experienced developers who want to learn how to write portable Makefiles. Regardless, in order to understand the examples you must be familiar with the usual steps for building programs on the command line (compiler, linker, object files, etc.). I’m not going to suggest any fancy tricks nor provide any sort of standard starting template. Makefiles should be dead simple when the project is small, and grow in a predictable, clean fashion alongside the project.

I’m not going to cover every feature. You’ll need to read the specification for yourself to learn it all. This tutorial will go over the important features as well as the common conventions. It’s important to follow established conventions so that people using your Makefiles will know what to expect and how to accomplish the basic tasks.

If you’re running Debian, or a Debian derivative such as Ubuntu, the bmake and freebsd-buildutils packages will provide the bmake and fmake programs respectively. These alternative make implementations are very useful for testing your Makefiles’ portability, should you accidentally make use of a GNU Make feature. It’s not perfect since each implements some of the same extensions as GNU Make, but it will catch some common mistakes.

What’s in a Makefile?

I am free, no matter what rules surround me. If I find them tolerable, I tolerate them; if I find them too obnoxious, I break them. I am free because I know that I alone am morally responsible for everything I do. ―Robert A. Heinlein

At make’s core are one or more dependency trees, constructed from rules. Each vertex in the tree is called a target. The final products of the build (executable, document, etc.) are the tree roots. A Makefile specifies the dependency trees and supplies the shell commands to produce a target from its prerequisites.

In this illustration, the “.c” files are source files that are written by hand, not generated by commands, so they have no prerequisites. The syntax for specifying one or more edges in this dependency tree is simple:

target [target...]: [prerequisite...]

While technically multiple targets can be specified in a single rule, this is unusual. Typically each target is specified in its own rule. To specify the tree in the illustration above:

game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

The order of these rules doesn’t matter. The entire Makefile is parsed before any actions are taken, so the tree’s vertices and edges can be specified in any order. There’s one exception: the first non-special target in a Makefile is the default target. This target is selected implicitly when make is invoked without choosing a target. It should be something sensible, so that a user can blindly run make and get a useful result.

A target can be specified more than once. Any new prerequisites are appended to the previously-given prerequisites. For example, this Makefile is identical to the previous, though it’s typically not written this way:

game: graphics.o
game: physics.o
game: input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

There are six special targets that are used to change the behavior of make itself. All have uppercase names and start with a period. Names fitting this pattern are reserved for use by make. According to the standard, in order to get reliable POSIX behavior, the first non-comment line of the Makefile must be .POSIX. Since this is a special target, it’s not a candidate for the default target, so game will remain the default target:

.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

In practice, even a simple program will have header files, and sources that include a header file should also have an edge on the dependency tree for it. If the header file changes, targets that include it should also be rebuilt.

.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h

Adding commands to rules

We’ve constructed a dependency tree, but we still haven’t told make how to actually build any targets from its prerequisites. The rules also need to specify the shell commands that produce a target from its prerequisites.

If you were to create the source files in the example and invoke make, you will find that it actually does know how to build the object files. This is because make is initially configured with certain inference rules, a topic which will be covered later. For now, we’ll add the .SUFFIXES special target to the top, erasing all the built-in inference rules.

Commands immediately follow the target/prerequisite line in a rule. Each command line must start with a tab character. This can be awkward if your text editor isn’t configured for it, and it will be awkward if you try to copy the examples from this page.

Each line is run in its own shell, so be mindful of using commands like cd, which won’t affect later lines.

The simplest thing to do is literally specify the same commands you’d type at the shell:

.POSIX:
.SUFFIXES:
game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c

Invoking make and choosing targets

I tried to walk into Target, but I missed. ―Mitch Hedberg

When invoking make, it accepts zero or more targets from the dependency tree, and it will build these targets — e.g. run the commands in the target’s rule — if the target is out-of-date. A target is out-of-date if it is older than any of its prerequisites.

# build the "game" binary (default target)
$ make

# build just the object files
$ make graphics.o physics.o input.o

This effect cascades up the dependency tree and causes further targets to be rebuilt until all of the requested targets are up-to-date. There’s a lot of room for parallelism since different branches of the tree can be updated independently. It’s common for make implementations to support parallel builds with the -j option. This is non-standard, but it’s a fantastic feature that doesn’t require anything special in the Makefile to work correctly.

Similar to parallel builds is make’s -k (“keep going”) option, which is standard. This tells make not to stop on the first error, and to continue updating targets that are unaffected by the error. This is nice for fully populating Vim’s quickfix list or Emacs’ compilation buffer.

It’s common to have multiple targets that should be built by default. If the first rule selects the default target, how do we solve the problem of needing multiple default targets? The convention is to use phony targets. These are called “phony” because there is no corresponding file, and so phony targets are never up-to-date. It’s convention for a phony “all” target to be the default target.

I’ll make game a prerequisite of a new “all” target. More real targets could be added as necessary to turn them into defaults. Users of this Makefile will also expect make all to build the entire project.

Another common phony target is “clean” which removes all of the built files. Users will expect make clean to delete all generated files.

.POSIX:
.SUFFIXES:
all: game
game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c
clean:
    rm -f game graphics.o physics.o input.o

Customize the build with macros

So far the Makefile hardcodes cc as the compiler, and doesn’t use any compiler flags (warnings, optimization, hardening, etc.). The user should be able to easily control all these things, but right now they’d have to edit the entire Makefile to do so. Perhaps the user has both gcc and clang installed, and wants to choose one or the other without changing which is installed as cc.

To solve this, make has macros that expand into strings when referenced. The convention is to use the macro named CC when talking about the C compiler, CFLAGS when talking about flags passed to the C compiler, LDFLAGS for flags passed to the C compiler when linking, and LDLIBS for flags about libraries when linking. The Makefile should supply defaults as needed.

A macro is expanded with $(...). It’s valid (and normal) to reference a macro that hasn’t been defined, which will be an empty string. This will be the case with LDFLAGS below.

Macro values can contain other macros, which will be expanded recursively each time the macro is expanded. Some make implementations allow the name of the macro being expanded to itself be a macro, which is turing complete, but this behavior is non-standard.

.POSIX:
.SUFFIXES:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
    $(CC) -c $(CFLAGS) graphics.c
physics.o: physics.c physics.h
    $(CC) -c $(CFLAGS) physics.c
input.o: input.c input.h graphics.h physics.h
    $(CC) -c $(CFLAGS) input.c
clean:
    rm -f game graphics.o physics.o input.o

Macros are overridden by macro definitions given as command line arguments in the form name=value. This allows the user to select their own build configuration. This is one of make’s most powerful and under-appreciated features.

$ make CC=clang CFLAGS='-O3 -march=native'

If the user doesn’t want to specify these macros on every invocation, they can (cautiously) use make’s -e flag to set overriding macros definitions from the environment.

$ export CC=clang
$ export CFLAGS=-O3
$ make -e all

Some make implementations have other special kinds of macro assignment operators beyond simple assignment (=). These are unnecessary, so don’t worry about them.

Inference rules so that you can stop repeating yourself

The road itself tells us far more than signs do. ―Tom Vanderbilt, Traffic: Why We Drive the Way We Do

There’s repetition across the three different object files. Wouldn’t it be nice if there was a way to communicate this pattern? Fortunately there is, in the form of inference rules. It says that a target with a certain extension, with a prerequisite with another certain extension, is built a certain way. This will make more sense with an example.

In an inference rule, the target indicates the extensions. The $< macro expands to the prerequisite, which is essential to making inference rules work generically. Unfortunately this macro is not available in target rules, as much as that would be useful.

For example, here’s an inference rule that teaches make how to build an object file from a C source file. This particular rule is one that is pre-defined by make, so you’ll never need to write this one yourself. I’ll include it for completeness.

.c.o:
    $(CC) $(CFLAGS) -c $<

These extensions must be added to .SUFFIXES before they will work. With that, the commands for the rules about object files can be omitted.

.POSIX:
.SUFFIXES:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
    rm -f game graphics.o physics.o input.o

.SUFFIXES: .c .o
.c.o:
    $(CC) $(CFLAGS) -c $<

The first empty .SUFFIXES clears the suffix list. The second one adds .c and .o to the now-empty suffix list.

Other target conventions

Conventions are, indeed, all that shield us from the shivering void, though often they do so but poorly and desperately. ―Robert Aickman

Users usually expect an “install” target that installs the built program, libraries, man pages, etc. By convention this target should use the PREFIX and DESTDIR macros.

The PREFIX macro should default to /usr/local, and since it’s a macro the user can override it to install elsewhere, such as in their home directory. The user should override it for both building and installing, since the prefix may need to be built into the binary (e.g. -DPREFIX=$(PREFIX)).

The DESTDIR is macro is used for staged builds, so that it gets installed under a fake root directory for the sake of packaging. Unlike PREFIX, it will not actually be run from this directory.

.POSIX:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm
PREFIX = /usr/local

all: game
install: game
    mkdir -p $(DESTDIR)$(PREFIX)/bin
    mkdir -p $(DESTDIR)$(PREFIX)/share/man/man1
    cp -f game $(DESTDIR)$(PREFIX)/bin
    gzip < game.1 > $(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
    rm -f game graphics.o physics.o input.o

You may also want to provide an “uninstall” phony target that does the opposite.

make PREFIX=$HOME/.local install

Other common targets are “mostlyclean” (like “clean” but don’t delete some slow-to-build targets), “distclean” (delete even more than “clean”), “test” or “check” (run the test suite), and “dist” (create a package).

Complexity and growing pains

One of make’s big weak points is scaling up as a project grows in size.

Recursive Makefiles

As your growing project is broken into subdirectories, you may be tempted to put a Makefile in each subdirectory and invoke them recursively.

Don’t use recursive Makefiles. It breaks the dependency tree across separate instances of make and typically results in a fragile build. There’s nothing good about it. Have one Makefile at the root of your project and invoke make there. You may have to teach your text editor how to do this.

When talking about files in subdirectories, just include the subdirectory in the name. Everything will work the same as far as make is concerned, including inference rules.

src/graphics.o: src/graphics.c
src/physics.o: src/physics.c
src/input.o: src/input.c

Out-of-source builds

Keeping your object files separate from your source files is a nice idea. When it comes to make, there’s good news and bad news.

The good news is that make can do this. You can pick whatever file names you like for targets and prerequisites.

obj/input.o: src/input.c

The bad news is that inference rules are not compatible with out-of-source builds. You’ll need to repeat the same commands for each rule as if inference rules didn’t exist. This is tedious for large projects, so you may want to have some sort of “configure” script, even if hand-written, to generate all this for you. This is essentially what CMake is all about. That, plus dependency management.

Dependency management

Another problem with scaling up is tracking the project’s ever-changing dependencies across all the source files. Missing a dependency means the build may not be correct unless you make clean first.

If you go the route of using a script to generate the tedious parts of the Makefile, both GCC and Clang have a nice feature for generating all the Makefile dependencies for you (-MM, -MT), at least for C and C++. There are lots of tutorials for doing this dependency generation on the fly as part of the build, but it’s fragile and slow. Much better to do it all up front and “bake” the dependencies into the Makefile so that make can do its job properly. If the dependencies change, rebuild your Makefile.

For example, here’s what it looks like invoking gcc’s dependency generator against the imaginary input.c for an out-of-source build:

$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h

Notice the output is in Makefile’s rule format.

Unfortunately this feature strips the leading paths from the target, so, in practice, using it is always more complicated than it should be (e.g. it requires the use of -MT).

Microsoft’s Nmake

Microsoft has an implementation of make called Nmake, which comes with Visual Studio. It’s nearly a POSIX-compatible make, but necessarily breaks from the standard in some places. Their cl.exe compiler uses .obj as the object file extension and .exe for binaries, both of which differ from the unix world, so it has different built-in inference rules. Windows also lacks a Bourne shell and the standard unix tools, so all of the commands will necessarily be different.

There’s no equivalent of rm -f on Windows, so good luck writing a proper “clean” target. No, del /f isn’t the same.

So while it’s close to POSIX make, it’s not practical to write a Makefile that will simultaneously work properly with both POSIX make and Nmake. These need to be separate Makefiles.

May your Makefiles be portable

It’s nice to have reliable, portable Makefiles that just work anywhere. Code to the standards and you don’t need feature tests or other sorts of special treatment.

Stack Clashing for Fun and Profit

2017-06-21T05:28:56Z

Stack clashing has been in the news lately due to some recently discovered vulnerablities along with proof-of-concept exploits. As the announcement itself notes, this is not a new issue, though this appears to be the first time it’s been given this particular name. I do know of one “good” use of stack clashing, where it’s used for something productive than as part of an attack. In this article I’ll explain how it works.

You can find the complete code for this article here, ready to run:

https://github.com/skeeto/stack-clash-coroutine

But first, what is a stack clash? Here’s a rough picture of the typical way process memory is laid out. The stack starts at a high memory address and grows downwards. Code and static data sit at low memory, with a brk pointer growing upward to make small allocations. In the middle is the heap, where large allocations and memory mappings take place.

Below the stack is a slim guard page that divides the stack and the region of memory reserved for the heap. Reading or writing to that memory will trap, causing the program to crash or some special action to be taken. The goal is to prevent the stack from growing into the heap, which could cause all sorts of trouble, like security issues.

The problem is that this thin guard page isn’t enough. It’s possible to put a large allocation on the stack, never read or write to it, and completely skip over the guard page, such that the heap and stack overlap without detection.

Once this happens, writes into the heap will change memory on the stack and vice versa. If an attacker can cause the program to make such a large allocation on the stack, then legitimate writes into memory on the heap can manipulate local variables or return pointers, changing the program’s control flow. This can bypass buffer overflow protections, such as stack canaries.

Binary trees and coroutines

Now, I’m going to abruptly change topics to discuss binary search trees. We’ll get back to stack clash in a bit. Suppose we have a binary tree which we would like to iterate depth-first. For this demonstration, here’s the C interface to the binary tree.

struct tree {
    struct tree *left;
    struct tree *right;
    char *key;
    char *value;
};

void  tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void  tree_visit(struct tree *, void (*f)(char *, char *));
void  tree_destroy(struct tree *);

An empty tree is the NULL pointer, hence the double-pointer for insert. In the demonstration it’s an unbalanced search tree, but this could very well be a balanced search tree with the addition of another field on the structure.

For the traversal, first visit the root node, then traverse its left tree, and finally traverse its right tree. It makes for a simple, recursive definition — the sort of thing you’d teach a beginner. Here’s a definition that accepts a callback, which the caller will use to visit each key/value in the tree. This really is as simple as it gets.

void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
    if (t) {
        f(t->key, t->value);
        tree_visit(t->left, f);
        tree_visit(t->right, f);
    }
}

Unfortunately this isn’t so convenient for the caller, who has to split off a callback function that lacks context, then hand over control to the traversal function.

void
printer(char *k, char *v)
{
    printf("%s = %s\n", k, v);
}

void
print_tree(struct tree *tree)
{
    tree_visit(tree, printer);
}

Usually it’s much nicer for the caller if instead it’s provided an iterator, which the caller can invoke at will. Here’s an interface for it, just two functions.

struct tree_it *tree_iterator(struct tree *);
int             tree_next(struct tree_it *, char **k, char **v);

The first constructs an iterator object, and the second one visits a key/value pair each time it’s called. It returns 0 when traversal is complete, automatically freeing any resources associated with the iterator.

The caller now looks like this:

    char *k, *v;
    struct tree_it *it = tree_iterator(tree);
    while (tree_next(it, &k, &v))
        printf("%s = %s\n", k, v);

Notice I haven’t defined struct tree_it. That’s because I’ve got four different implementations, each taking a different approach. The last one will use stack clashing.

Manual State Tracking

With just the standard facilities provided by C, there’s a some manual bookkeeping that has to take place in order to convert the recursive definition into an iterator. Depth-first traversal is a stack-oriented process, and with recursion the stack is implicit in the call stack. As an iterator, the traversal stack needs to be managed explicitly. The iterator needs to keep track of the path it took so that it can backtrack, which means keeping track of parent nodes as well as which branch was taken.

Here’s my little implementation, which, to keep things simple, has a hard depth limit of 32. It’s structure definition includes a stack of node pointers, and 2 bits of information per visited node, stored across a 64-bit integer.

struct tree_it {
    struct tree *stack[32];
    unsigned long long state;
    int nstack;
};

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack[0] = t;
    it->state = 0;
    it->nstack = 1;
    return it;
}

The 2 bits track three different states for each visited node:

Visit the current node
Traverse the left tree
Traverse the right tree

It works out to the following. Don’t worry too much about trying to understand how this works. My point is to demonstrate that converting the recursive definition into an iterator complicates the implementation.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    while (it->nstack) {
        int shift = (it->nstack - 1) * 2;
        int state = 3u & (it->state >> shift);
        struct tree *t = it->stack[it->nstack - 1];
        it->state += 1ull << shift;
        switch (state) {
            case 0:
                *k = t->key;
                *v = t->value;
                if (t->left) {
                    it->stack[it->nstack++] = t->left;
                    it->state &= ~(3ull << (shift + 2));
                }
                return 1;
            case 1:
                if (t->right) {
                    it->stack[it->nstack++] = t->right;
                    it->state &= ~(3ull << (shift + 2));
                }
                break;
            case 2:
                it->nstack--;
                break;
        }
    }
    free(it);
    return 0;
}

Wouldn’t it be nice to keep both the recursive definition while also getting an iterator? There’s an exact solution to that: coroutines.

Coroutines

C doesn’t come with coroutines, but there are a number of libraries available. We can also build our own coroutines. One way to do that is with user contexts () provided by the X/Open System Interfaces Extension (XSI), an extension to POSIX. This set of functions allow programs to create their own call stacks and switch between them. That’s the key ingredient for coroutines. Caveat: These functions aren’t widely available, and probably shouldn’t be used in new code.

Here’s my iterator structure definition.

#define _XOPEN_SOURCE 600
#include 

struct tree_it {
    char *k;
    char *v;
    ucontext_t coroutine;
    ucontext_t yield;
};

It needs one context for the original stack and one context for the iterator’s stack. Each time the iterator is invoked, it the program will switch to the other stack, find the next value, then switch back. This process is called yielding. Values are passed between context using the k (key) and v (value) fields on the iterator.

Before I get into initialization, here’s the actual traversal coroutine. It’s nearly the same as the original recursive definition except for the swapcontext(). This is the yield, pausing execution and sending control back to the caller. The current context is saved in the first argument, and the second argument becomes the current context.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        swapcontext(&it->coroutine, &it->yield);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

While the actual traversal is simple again, initialization is more complicated. The first problem is that there’s no way to pass pointer arguments to the coroutine. Technically only int arguments are permitted. (All the online tutorials get this wrong.) To work around this problem, I smuggle the arguments in as global variables. This would cause problems should two different threads try to create iterators at the same time, even on different trees.

static struct tree *tree_arg;
static struct tree_it *tree_it_arg;

static void
coroutine_init(void)
{
    coroutine(tree_arg, tree_it_arg);
}

The stack has to be allocated manually, which I do with a call to malloc(). Nothing fancy is needed, though this means the new stack won’t have a guard page. For the stack size, I use the suggested value of SIGSTKSZ. The makecontext() function is what creates the new context from scratch, but the new context must first be initialized with getcontext(), even though that particular snapshot won’t actually be used.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
    it->coroutine.uc_stack.ss_size = SIGSTKSZ;
    it->coroutine.uc_link = &it->yield;
    getcontext(&it->coroutine);
    makecontext(&it->coroutine, coroutine_init, 0);
    tree_arg = t;
    tree_it_arg = it;
    return it;
}

Notice I gave it a function pointer, a lot like I’m starting a new thread. This is no coincidence. There’s a lot of similarity between coroutines and multiple threads, as you’ll soon see.

Finally the iterator function itself. Since NULL isn’t a valid key, it initializes the key to NULL before yielding to the iterator context. If the iterator has no more nodes to visit, it doesn’t set the key, which can be detected when control returns.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    swapcontext(&it->yield, &it->coroutine);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->coroutine.uc_stack.ss_sp);
        free(it);
        return 0;
    }
}

That’s all it takes to create and operate a coroutine in C, provided you’re on a system with these XSI extensions.

Semaphores

Instead of a coroutine, we could just use actual threads and a couple of semaphores to synchronize them. This is a heavy implementation and also probably shouldn’t be used in practice, but at least it’s fully portable.

Here’s the structure definition:

struct tree_it {
    struct tree *t;
    char *k;
    char *v;
    sem_t visitor;
    sem_t main;
    pthread_t thread;
};

The main thread will wait on one semaphore and the iterator thread will wait on the other. This should sound very familiar.

The actual traversal function looks the same, but with sem_post() and sem_wait() as the yield.

static void
visit(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        sem_post(&it->main);
        sem_wait(&it->visitor);
        visit(t->left, it);
        visit(t->right, it);
    }
}

There’s a separate function to initialize the iterator context again.

static void *
thread_entrance(void *arg)
{
    struct tree_it *it = arg;
    sem_wait(&it->visitor);
    visit(it->t, it);
    sem_post(&it->main);
    return 0;
}

Creating the iterator only requires initializing the semaphores and creating the thread:

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->t = t;
    sem_init(&it->visitor, 0, 0);
    sem_init(&it->main, 0, 0);
    pthread_create(&it->thread, 0, thread_entrance, it);
    return it;
}

The iterator function looks just like the coroutine version.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    sem_post(&it->visitor);
    sem_wait(&it->main);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        pthread_join(it->thread, 0);
        sem_destroy(&it->main);
        sem_destroy(&it->visitor);
        free(it);
        return 0;
    }
}

Overall, this is almost identical to the coroutine version.

Coroutines using stack clashing

Finally I can tie this back into the topic at hand. Without either XSI extensions or Pthreads, we can (usually) create coroutines by abusing setjmp() and longjmp(). Technically this violates two of the C’s rules and relies on undefined behavior, but it generally works. This is not my own invention, and it dates back to at least 2010.

From the very beginning, C has provided a crude “exception” mechanism that allows the stack to be abruptly unwound back to a previous state. It’s a sort of non-local goto. Call setjmp() to capture an opaque jmp_buf object to be used in the future. This function returns 0 this first time. Hand that value to longjmp() later, even in a different function, and setjmp() will return again, this time with a non-zero value.

It’s technically unsuitable for coroutines because the jump is a one-way trip. The unwound stack invalidates any jmp_buf that was created after the target of the jump. In practice, though, you can still use these jumps, which is one rule being broken.

That’s where stack clashing comes into play. In order for it to be a proper coroutine, it needs to have its own stack. But how can we do that with these primitive C utilities? Extend the stack to overlap the heap, call setjmp() to capture a coroutine on it, then return. Generally we can get away with using longjmp() to return to this heap-allocated stack.

Here’s my iterator definition for this one. Like the XSI context struct, this has two jmp_buf “contexts.” The stack holds the iterator’s stack buffer so that it can be freed, and the gap field will be used to prevent the optimizer from spoiling our plans.

struct tree_it {
    char *k;
    char *v;
    char *stack;
    volatile char *gap;
    jmp_buf coroutine;
    jmp_buf yield;
};

The coroutine looks familiar again. This time the yield is performed with setjmmp() and longjmp(), just like swapcontext(). Remember that setjmp() returns twice, hence the branch. The longjmp() never returns.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        if (!setjmp(it->coroutine))
            longjmp(it->yield, 1);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

Next is the tricky part to cause the stack clash. First, allocate the new stack with malloc() so that we can get its address. Then use a local variable on the stack to determine how much the stack needs to grow in order to overlap with the allocation. Taking the difference between these pointers is illegal as far as the language is concerned, making this the second rule I’m breaking. I can imagine an implementation where the stack and heap are in two separate kinds of memory, and it would be meaningless to take the difference. I don’t actually have to imagine very hard, because this is actually how it used to work on the 8086 with its segmented memory architecture.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack = malloc(STACK_SIZE);
    char marker;
    char gap[&marker - it->stack - STACK_SIZE];
    it->gap = gap; // prevent optimization
    if (!setjmp(it->yield))
        coroutine_init(t, it);
    return it;
}

I’m using a variable-length array (VLA) named gap to indirectly control the stack pointer, moving it over the heap. I’m assuming the stack grows downward, since otherwise the sign would be wrong.

The compiler is smart and will notice I’m not actually using gap, and it’s happy to throw it away. In fact, it’s vitally important that I don’t touch it since the guard page, along with a bunch of unmapped memory, is actually somewhere in the middle of that array. I only want the array for its side effect, but that side effect isn’t officially supported, which means the optimizer doesn’t need to consider it in its decisions. To inhibit the optimizer, I store the array’s address where someone might potentially look at it, meaning the array has to exist.

Finally, the iterator function looks just like the others, again.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    if (!setjmp(it->yield))
        longjmp(it->coroutine, 1);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->stack);
        free(it);
        return 0;
    }
}

And that’s it: a nasty hack using a stack clash to create a context for a setjmp()+longjmp() coroutine.

How to Write Portable C Without Complicating Your Build

2017-03-30T04:06:58Z

Suppose you’re writing a non-GUI C application intended to run on a number of operating systems: Linux, the various BSDs, macOS, classical unix, and perhaps even something as exotic as Windows. It might sound like a rather complicated problem. These operating systems have slightly different interfaces (or very different in one case), and they run different variants of the standard unix tools — a problem for portable builds.

With some up-front attention to detail, this is actually not terribly difficult. Unix-like systems are probably the least diverse and least buggy they’ve ever been. Writing portable code is really just a matter of coding to the standards and ignoring extensions unless absolutely necessary. Knowing what’s standard and what’s extension is the tricky part, but I’ll explain how to find this information.

You might be tempted to reach for an overly complicated solution such as GNU Autoconf. Sure, it creates a configure script with the familiar, conventional interface. This has real value. But do you really need to run a single-threaded gauntlet of hundreds of feature/bug tests for things that sometimes worked incorrectly in some weird unix variant back in the 1990s? On a machine with many cores (parallel build, -j), this may very well be the slowest part of the whole build process.

For example, the configure script for Emacs checks that the compiler supplies stdlib.h, string.h, and getenv — things that were standardized nearly 30 years ago. It also checks for a slew of POSIX functions that have been standard since 2001.

There’s a much easier solution: Document that the application requires, say, C99 and POSIX.1-2001. It’s the responsibility of the person building the application to supply these implementations, so there’s no reason to waste time testing for it.

How to code to the standards

Suppose there’s some function you want to use, but you’re not sure if it’s standard or an extension. Or maybe you don’t know what standard it comes from. Luckily the man pages document this stuff very well, especially on Linux. Check the friendly “CONFORMING TO” section. For example, look at getenv(3). Here’s what that section has to say:

CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.

This says this function comes from the original C standard. It’s always available on anything that claims to be a C implementation. The man page also documents secure_getenv(), which is a GNU extension: to be avoided in anything intended to be portable.

What about sleep(3)?

CONFORMING TO
    POSIX.1-2001.

This function isn’t part of standard C, but it’s available on any system claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the program needs to run on an operating system not implementing this POSIX standard (i.e. Windows), you’ll need to call an alternative function, probably inside a different #if .. #endif branch. More on this in a moment.

If you’re coding to POSIX, you must define the _POSIX_C_SOURCE feature test macro to the standard you intend to use prior to any system header includes:

A POSIX-conforming application should ensure that the feature test macro _POSIX_C_SOURCE is defined before inclusion of any header.

For example, to properly access POSIX.1-2001 functions in your application, define _POSIX_C_SOURCE to 200112L. With this defined, it’s safe to assume access to all of C and everything from that standard of POSIX. You can do this at the top of your sources, but I personally like the tidiness of a global config.h that gets included before everything.

How to create a portable build

So you’ve written clean, portable C to the standards. How do you build this application? The natural choice is make. It’s available everywhere and it’s part of POSIX.

Again, the tricky part is teasing apart the standard from the extension. I’m a long-time sinner in this regard, having far too often written Makefiles that depend on GNU Make extensions. This is a real pain when building programs on systems without the GNU utilities. I’ve been making amends (and finding some bugs as a result).

No implementation makes the division clear in its documentation, and especially don’t bother looking at the GNU Make manual. Your best resource is the standard itself. If you’re already familiar with make, coding to the standard is largely a matter of unlearning the various extensions you know.

Outside of some hacks, this means you don’t get conditionals (if, else, etc.). With some practice, both with sticking to portable code and writing portable Makefiles, you’ll find that you don’t really need them. Following the macro conventions will cover most situations. For example:

CC: the C compiler program
CFLAGS: flags to pass to the C compiler
LDFLAGS: flags to pass to the linker (via the C compiler)
LDLIBS: libraries to pass to the linker

You don’t need to do anything weird with the assignments. The user invoking make can override them easily. For example, here’s part of a Makefile:

CC     = c99
CFLAGS = -Wall -Wextra -Os

But the user wants to use clang, and their system needs to explicitly link -lsocket (e.g. Solaris). The user can override the macro definitions on the command line:

$ make CC=clang LDLIBS=-lsocket

The same rules apply to the programs you invoke from the Makefile. Read the standards documents and ignore your system’s man pages as to avoid accidentally using an extension. It’s especially valuable to learn the Bourne shell language and avoid any accidental bashisms in your Makefiles and scripts. The dash shell is good for testing your scripts.

Makefiles conforming to the standard will, unfortunately, be more verbose than those taking advantage of a particular implementation. If you know how to code Bourne shell — which is not terribly difficult to learn — then you might even consider hand-writing a configure script to generate the Makefile (a la metaprogramming). This gives you a more flexible language with conditionals, and, being generated, redundancy in the Makefile no longer matters.

As someone who frequently dabbles with BSD systems, my life has gotten a lot easier since learning to write portable Makefiles and scripts.

But what about Windows

It’s the elephant in the room and I’ve avoided talking about it so far. If you want to build with Visual Studio’s command line tools — something I do on occasion — build portability goes out the window. Visual Studio has nmake.exe, which nearly conforms to POSIX make. However, without the standard unix utilities and with the completely foreign compiler interface for cl.exe, there’s absolutely no hope of writing a Makefile portable to this situation.

The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the unix utilities, though it has the problem of linking against msvcrt.dll. Another option is a separate Makefile dedicated to nmake.exe and the Visual Studio toolchain. Good luck defining a correctly working “clean” target with del.exe.

My preferred approach lately is an amalgamation build (as seen in Enchive): Carefully concatenate all the application’s sources into one giant source file. First concatenate all the headers in the right order, followed by all the C files. Use sed to remove and local includes. You can do this all on a unix system with the nice utilities, then point cl.exe at the amalgamation for the Visual Studio build. It’s not very useful for actual development (i.e. you don’t want to edit the amalgamation), but that’s what MinGW-w64 resolves.

What about all those POSIX functions? You’ll need to find Win32 replacements on MSDN. I prefer to do this is by abstracting those operating system calls. For example, compare POSIX sleep(3) and Win32 Sleep().

#if defined(_WIN32)
#include 

void
my_sleep(int s)
{
    Sleep(s * 1000);  // TODO: handle overflow, maybe
}

#else /* __unix__ */
#include 

void
my_sleep(int s)
{
    sleep(s);  // TODO: fix signal interruption
}
#endif

Then the rest of the program calls my_sleep(). There’s another example in the OpenMP article with pwrite(2) and WriteFile(). This demonstrates that supporting a bunch of different unix-like systems is really easy, but introducing Windows portability adds a disproportionate amount of complexity.

Caveat: paths and filenames

There’s one major complication with filenames for applications portable to Windows. In the unix world, filenames are null-terminated bytestrings. Typically these are Unicode strings encoded as UTF-8, but it’s not necessarily so. The kernel just sees bytestrings. A bytestring doesn’t necessarily have a formal Unicode representation, which can be a problem for languages that want filenames to be Unicode strings (also).

On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up being neither. They’re really null-terminated unsigned 16-bit integer arrays. It’s almost UTF-16 except that Windows allows unpaired surrogates. This means Windows filenames also don’t have a formal Unicode representation, but in a completely different way than unix. Some heroic efforts have gone into working around this issue.

As a result, it’s highly non-trivial to correctly support all possible filenames on both systems in the same program, especially when they’re passed as command line arguments.

Summary

The key points are:

Document the standards your application requires and strictly stick to them.
Ignore the vendor documentation if it doesn’t clearly delineate extensions.

This was all a discussion of non-GUI applications, and I didn’t really touch on libraries. Many libraries are simple to access in the build (just add it to LDLIBS), but some libraries — GUIs in particular — are particularly complicated to manage portably and will require a more complex solution (pkg-config, CMake, Autoconf, etc.).

OpenMP and pwrite()

2017-03-01T21:22:24Z

The most common way I introduce multi-threading to small C programs is with OpenMP (Open Multi-Processing). It’s typically used as compiler pragmas to parallelize computationally expensive loops — iterations are processed by different threads in some arbitrary order.

Here’s an example that computes the frames of a video in parallel. Despite being computed out of order, each frame is written in order to a large buffer, then written to standard output all at once at the end.

size_t size = sizeof(struct frame) * num_frames;
struct frame *output = malloc(size);
float beta = DEFAULT_BETA;

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    float theta = compute_theta(i);
    compute_frame(&output[i], theta, beta);
}

write(STDOUT_FILENO, output, size);
free(output);

Adding OpenMP to this program is much simpler than introducing low-level threading semantics with, say, Pthreads. With care, there’s often no need for explicit thread synchronization. It’s also fairly well supported by many vendors, even Microsoft (up to OpenMP 2.0), so a multi-threaded OpenMP program is quite portable without #ifdef.

There’s real value this pragma API: The above example would still compile and run correctly even when OpenMP isn’t available. The pragma is ignored and the program just uses a single core like it normally would. It’s a slick fallback.

When a program really does require synchronization there’s omp_lock_t (mutex lock) and the expected set of functions to operate on them. This doesn’t have the nice fallback, so I don’t like to use it. Instead, I prefer #pragma omp critical. It nicely maintains the OpenMP-unsupported fallback.

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    struct frame *frame = malloc(sizeof(*frame));
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    #pragma omp critical
    {
        write(STDOUT_FILENO, frame, sizeof(*frame));
    }
    free(frame);
}

This would append the output to some output file in an arbitrary order. The critical section prevents interleaving of outputs.

There are a couple of problems with this example:

Only one thread can write at a time. If the write takes too long, other threads will queue up behind the critical section and wait.
The output frames will be out of order, which is probably inconvenient for consumers. If the output is seekable this can be solved with lseek(), but that only makes the critical section even more important.

There’s an easy fix for both, and eliminates the need for a critical section: POSIX pwrite().

ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

It’s like write() but has an offset parameter. Unlike lseek() followed by a write(), multiple threads and processes can, in parallel, safely write to the same file descriptor at different file offsets. The catch is that the output must be a file, not a pipe.

#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    size_t size = sizeof(struct frame);
    struct frame *frame = malloc(size);
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    pwrite(STDOUT_FILENO, frame, size, size * i);
    free(frame);
}

There’s no critical section, the writes can interleave, and the output is in order.

If you’re concerned about standard output not being seekable (it often isn’t), keep in mind that it will work just fine when invoked like so:

$ ./compute_frames > frames.ppm

Windows Portability

I talked about OpenMP being really portable, then used POSIX functions. Fortunately the Win32 WriteFile() function has an “overlapped” parameter that works just like pwrite(). Typically rather than call either directly, I’d wrap the write like so:

#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include 

static int
write_frame(struct frame *f, int i)
{
    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD written;
    OVERLAPPED offset = {.Offset = sizeof(*f) * i};
    return WriteFile(out, f, sizeof(*f), &written, &offset);
}

#else /* POSIX */
#include 

static int
write_frame(struct frame *f, int i)
{
    size_t count = sizeof(*f);
    size_t offset = sizeof(*f) * i;
    return pwrite(STDOUT_FILENO, buf, count, offset) == count;
}
#endif

Except for switching to write_frame(), the OpenMP part remains untouched.

Real World Example

Here’s an example in a real program:

julia.c

Notice because of pwrite() there’s no piping directly into ppmtoy4m:

$ ./julia > output.ppm
$ ppmtoy4m -F 60:1 < output.ppm > output.y4m
$ x264 -o output.mp4 output.y4m

output.mp4

Portable Structure Access with Member Offset Constants

2016-11-22T12:55:29Z

Suppose you need to write a C program to access a long sequence of structures from a binary file in a specified format. These structures have different lengths and contents, but also a common header identifying its type and size. Here’s the definition of that header (no padding):

struct event {
    uint64_t time;   // unix epoch (microseconds)
    uint32_t size;   // including this header (bytes)
    uint16_t source;
    uint16_t type;
};

The size member is used to find the offset of the next structure in the file without knowing anything else about the current structure. Just add size to the offset of the current structure.

The type member indicates what kind of data follows this structure. The program is likely to switch on this value.

The actual structures might look something like this (in the spirit of X-COM). Note how each structure begins with struct event as header. All angles are expressed using binary scaling.

#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30

struct observer {
    struct event event;
    uint32_t latitude;   // binary scaled angle
    uint32_t longitude;  //
    uint16_t source_id;  // later used for event source
    uint16_t name_size;  // not including null terminator
    char name[];
};

struct ufo_sighting {
    struct event event;
    uint32_t azimuth;    // binary scaled angle
    uint32_t elevation;  //
};

struct suspicious_signal {
    struct event event;
    uint16_t num_channels;
    uint16_t sample_rate;  // Hz
    uint32_t num_samples;  // per channel
    int16_t samples[];
};

If all integers are stored in little endian byte order (least significant byte first), there’s a strong temptation to lay the structures directly over the data. After all, this will work correctly on most computers.

struct event header;
fread(buffer, sizeof(header), 1, file);
switch (header.type) {
    // ...
}

This code will not work correctly when:

The host machine doesn’t use little endian byte order, though this is now uncommon. Sometimes developers will attempt to detect the byte order at compile time and use the preprocessor to byte-swap if needed. This is a mistake.
The host machine has different alignment requirements and so introduces additional padding to the structure. Sometimes this can be resolved with a non-standard #pragma pack.

Integer extraction functions

Fortunately it’s easy to write fast, correct, portable code for this situation. First, define some functions to extract little endian integers from an octet buffer (uint8_t). These will work correctly regardless of the host’s alignment and byte order.

static inline uint16_t
extract_u16le(const uint8_t *buf)
{
    return (uint16_t)buf[1] << 8 |
           (uint16_t)buf[0] << 0;
}

static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return (uint32_t)buf[3] << 24 |
           (uint32_t)buf[2] << 16 |
           (uint32_t)buf[1] <<  8 |
           (uint32_t)buf[0] <<  0;
}

static inline uint64_t
extract_u64le(const uint8_t *buf)
{
    return (uint64_t)buf[7] << 56 |
           (uint64_t)buf[6] << 48 |
           (uint64_t)buf[5] << 40 |
           (uint64_t)buf[4] << 32 |
           (uint64_t)buf[3] << 24 |
           (uint64_t)buf[2] << 16 |
           (uint64_t)buf[1] <<  8 |
           (uint64_t)buf[0] <<  0;
}

The big endian version is identical, but with shifts in reverse order.

A common concern is that these functions are a lot less efficient than they could be. On x86 where alignment is very relaxed, each could be implemented as a single load instruction. However, on GCC 4.x and earlier, extract_u32le compiles to something like this:

extract_u32le:
        movzx   eax, [rdi+3]
        sal     eax, 24
        mov     edx, eax
        movzx   eax, [rdi+2]
        sal     eax, 16
        or      eax, edx
        movzx   edx, [rdi]
        or      eax, edx
        movzx   edx, [rdi+1]
        sal     edx, 8
        or      eax, edx
        ret

It’s tempting to fix the problem with the following definition:

// Note: Don't do this.
static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return *(uint32_t *)buf;
}

It’s unportable, it’s undefined behavior, and worst of all, it might not work correctly even on x86. Fortunately I have some great news. On GCC 5.x and above, the correct definition compiles to the desired, fast version. It’s the best of both worlds.

extract_u32le:
        mov     eax, [rdi]
        ret

It’s even smart about the big endian version:

static inline uint32_t
extract_u32be(const uint8_t *buf)
{
    return (uint32_t)buf[0] << 24 |
           (uint32_t)buf[1] << 16 |
           (uint32_t)buf[2] <<  8 |
           (uint32_t)buf[3] <<  0;
}

Is compiled to exactly what you’d want:

extract_u32be:
        mov     eax, [rdi]
        bswap   eax
        ret

Or, even better, if your system supports movbe (gcc -mmovbe):

extract_u32be:
        movbe   eax, [rdi]
        ret

Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m betting it will eventually learn how to do this, too.

Member offset constants

For this next technique, that struct event from above need not actually be in the source. It’s purely documentation. Instead, let’s define the structure in terms of member offset constants — a term I just made up for this article. I’ve included the integer types as part of the name to aid in their correct use.

#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14

Given a buffer, the integer extraction functions, and these offsets, structure members can be plucked out on demand.

uint8_t *buf;
// ...
uint64_t time   = extract_u64le(buf + EVENT_U64LE_TIME);
uint32_t size   = extract_u32le(buf + EVENT_U32LE_SIZE;
uint16_t source = extract_u16le(buf + EVENT_U16LE_SOURCE);
uint16_t type   = extract_u16le(buf + EVENT_U16LE_TYPE);

On x86 with GCC 5.x, each member access will be inlined and compiled to a one-instruction extraction. As far as performance is concerned, it’s identical to using a structure overlay, but this time the C code is clean and portable. A slight downside is the lack of type checking on member access: it’s easy to mismatch the types and accidentally read garbage.

Memory mapping and iteration

There’s a real advantage to memory mapping the input file and using its contents directly. On a system with a huge virtual address space, such as x86-64 or AArch64, this memory is almost “free.” Already being backed by a file, paging out this memory costs nothing (i.e. it’s discarded). The input file can comfortably be much larger than physical memory without straining the system.

Unportable structure overlay can take advantage of memory mapping this way, but has the previously-described issues. An approach with member offset constants will take advantage of it just as well, all while remaining clean and portable.

I like to wrap the memory mapping code into a simple interface, which makes porting to non-POSIX platforms, such Windows, easier. Caveat: This won’t work with files whose size exceeds the available contiguous virtual memory of the system — a real problem for 32-bit systems.

#include 
#include 
#include 
#include 

uint8_t *
map_file(const char *path, size_t *length)
{
    int fd = open(path, O_RDONLY);
    if (fd == -1)
        return 0;

    struct stat stat;
    if (fstat(fd, &stat) == -1) {
        close(fd);
        return 0;
    }

    *length = stat.st_size;  // TODO: possible overflow
    uint8_t *p = mmap(0, *length, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

void
unmap_file(uint8_t *p, size_t length)
{
    munmap(p, length);
}

Next, here’s an example that iterates over all the structures in input_file, in this case counting each. The size member is extracted in order to stride to the next structure.

size_t length;
uint8_t *data = map_file(input_file, &length);
if (!data)
    FATAL();

size_t event_count = 0;
uint8_t *p = data;
while (p < data + length) {
    event_count++;
    uint32_t size = extract_u32le(p + EVENT_U32LE_SIZE);
    if (size > length - (p - data))
        FATAL();  // invalid size
    p += size;
}
printf("I see %zu events.\n", event_count);

unmap_file(data, length);

This is the basic structure for navigating this kind of data. A deeper dive would involve a switch inside the loop, extracting the relevant members for whatever use is needed.

Fast, correct, simple. Pick three.

Appending to a File from Multiple Processes

2016-08-03T16:17:44Z

Suppose you have multiple processes appending output to the same file without explicit synchronization. These processes might be working in parallel on different parts of the same problem, or these might be threads blocked individually reading different external inputs. There are two concerns that come into play:

1) The append must be atomic such that it doesn’t clobber previous appends by other threads and processes. For example, suppose a write requires two separate operations: first moving the file pointer to the end of the file, then performing the write. There would be a race condition should another process or thread intervene in between with its own write.

2) The output will be interleaved. The primary solution is to design the data format as atomic records, where the ordering of records is unimportant — like rows in a relational database. This could be as simple as a text file with each line as a record. The concern is then ensuring records are written atomically.

This article discusses processes, but the same applies to threads when directly dealing with file descriptors.

Appending

The first concern is solved by the operating system, with one caveat. On POSIX systems, opening a file with the O_APPEND flag will guarantee that writes always safely append.

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

However, this says nothing about interleaving. Two processes successfully appending to the same file will result in all their bytes in the file in order, but not necessarily contiguously.

The caveat is that not all filesystems are POSIX-compatible. Two famous examples are NFS and the Hadoop Distributed File System (HDFS). On these networked filesystems, appends are simulated and subject to race conditions.

On POSIX systems, fopen(3) with the a flag will use O_APPEND, so you don’t necessarily need to use open(2). On Linux this can be verified for any language’s standard library with strace.

#include 

int main(void)
{
    fopen("/dev/null", "a");
    return 0;
}

And the result of the trace:

$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

For Win32, the equivalent is the FILE_APPEND_DATA access right, and similarly only applies to “local files.”

Interleaving and Pipes

The interleaving problem has two layers, and gets more complicated the more correct you want to be. Let’s start with pipes.

On POSIX, a pipe is unseekable and doesn’t have a file position, so appends are the only kind of write possible. When writing to a pipe (or FIFO), writes less than the system-defined PIPE_BUF are guaranteed to be atomic and non-interleaving.

Write requests of PIPE_BUF bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than PIPE_BUF bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, […]

The minimum value for PIPE_BUF for POSIX systems is 512 bytes. On Linux it’s 4kB, and on other systems it’s as high as 32kB. As long as each record is less than 512 bytes, a simple write(2) will due. None of this depends on a filesystem since no files are involved.

If more than PIPE_BUF bytes isn’t enough, the POSIX writev(2) can be used to atomically write up to IOV_MAX buffers of PIPE_BUF bytes. The minimum value for IOV_MAX is 16, but is typically 1024. This means the maximum safe atomic write size for pipes — and therefore the largest record size — for a perfectly portable program is 8kB (16✕512). On Linux it’s 4MB.

That’s all at the system call level. There’s another layer to contend with: buffered I/O in your language’s standard library. Your program may pass data in appropriately-sized pieces for atomic writes to the I/O library, but it may be undoing your hard work, concatenating all these writes into a buffer, splitting apart your records. For this part of the article, I’ll focus on single-threaded C programs.

Suppose you’re writing a simple space-separated format with one line per record.

int foo, bar;
float baz;
while (condition) {
    // ...
    printf("%d %d %f\n", foo, bar, baz);
}

Whether or not this works depends on how stdout is buffered. C standard library streams (FILE *) have three buffering modes: unbuffered, line buffered, and fully buffered. Buffering is configured through setbuf(3) and setvbuf(3), and the initial buffering state of a stream depends on various factors. For buffered streams, the default buffer is at least BUFSIZ bytes, itself at least 256 (C99 §7.19.2¶7). Note: threads share this buffer.

Since each record in the above program easily fits inside 256 bytes, if stdout is a line buffered pipe then this program will interleave correctly on any POSIX system without further changes.

If instead your output is comma-separated values (CSV) and your records may contain new line characters, there are two approaches. In each, the record must still be no larger than PIPE_BUF bytes.

Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3)) and output the entire buffer in a single fwrite(3). While I believe this will always work in practice, it’s not guaranteed by the C specification, which defines fwrite(3) as a series of fputc(3) calls (C99 §7.19.8.2¶2).
Fully buffered pipe: set a sufficiently large stream buffer and follow each record with a fflush(3). Unlike fwrite(3) on an unbuffered stream, the specification says the buffer will be “transmitted to the host environment as a block” (C99 §7.19.3¶3), so this should be perfectly correct on any POSIX system.

If your situation is more complicated than this, you’ll probably have to bypass your standard library buffered I/O and call write(2) or writev(2) yourself.

Practical Application

If interleaving writes to a pipe stdout sounds contrived, here’s the real life scenario: GNU xargs with its --max-procs (-P) option to process inputs in parallel.

xargs -n1 -P$(nproc) myprogram < inputs.txt | cat > outputs.csv

The | cat ensures the output of each myprogram process is connected to the same pipe rather than to the same file.

A non-portable alternative to | cat, especially if you’re dispatching processes and threads yourself, is the splice(2) system call on Linux. It efficiently moves the output from the pipe to the output file without an intermediate copy to userspace. GNU Coreutils’ cat doesn’t use this.

Win32 Pipes

On Win32, anonymous pipes have no semantics regarding interleaving. Named pipes have per-client buffers that prevent interleaving. However, the pipe buffer size is unspecified, and requesting a particular size is only advisory, so it comes down to trial and error, though the unstated limits should be comparatively generous.

Interleaving and Files

Suppose instead of a pipe we have an O_APPEND file on POSIX. Common wisdom states that the same PIPE_BUF atomic write rule applies. While this often works, especially on Linux, this is not correct. The POSIX specification doesn’t require it and there are systems where it doesn’t work.

If you know the particular limits of your operating system and filesystem, and you don’t care much about portability, then maybe you can get away with interleaving appends. For full portability, pipes are required.

On Win32, writes on local files up to the underlying drive’s sector size (typically 512 bytes to 4kB) are atomic. Otherwise the only options are deprecated Transactional NTFS (TxF), or manually synchronizing your writes. All in all, it’s going to take more work to get correct.

Conclusion

My true use case for mucking around with clean, atomic appends is to compute giant CSV tables in parallel, with the intention of later loading into a SQL database (i.e. SQLite) for analysis. A more robust and traditional approach would be to write results directly into the database as they’re computed. But I like the platform-neutral intermediate CSV files — good for archival and sharing — and the simplicity of programs generating the data — concerned only with atomic write semantics rather than calling into a particular SQL database API.

Makefile Assignments are Turing-Complete

2016-04-30T03:01:22Z

For over a decade now, GNU Make has almost exclusively been my build system of choice, either directly or indirectly. Unfortunately this means I unnecessarily depend on some GNU extensions — an annoyance when porting to the BSDs. In an effort to increase the portability of my Makefiles, I recently read the POSIX make specification. I learned two important things: 1) ~~POSIX make is so barren it’s not really worth striving for~~ (update: I’ve changed my mind), and 2) make’s macro assignment mechanism is Turing-complete.

If you want to see it in action for yourself before reading further, here’s a Makefile that implements Conway’s Game of Life (40x40) using only macro assignments.

life.mak (174kB) [or generate your own]

Run it with any make program in an ANSI terminal. It must literally be named life.mak. Beware: if you run it longer than a few minutes, your computer may begin thrashing.

make -f life.mak

It’s 100% POSIX-compatible except for the sleep 0.1 (fractional sleep), which is only needed for visual effect.

A POSIX workaround

Unlike virtually every real world implementation, POSIX make doesn’t support conditional parts. For example, you might want your Makefile’s behavior to change depending on the value of certain variables. In GNU Make it looks like this:

ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
else
    EXTRA_FLAGS = -Wbar
endif

Or BSD-style:

.ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
.else
    EXTRA_FLAGS = -Wbar
.endif

If the goal is to write a strictly POSIX Makefile, how could I work around the lack of conditional parts and maintain a similar interface? The selection of macro/variable to evaluate can be dynamically selected, allowing for some useful tricks. First define the option’s default:

USE_FOO = 0

Then define both sets of flags:

EXTRA_FLAGS_0 = -Wbar
EXTRA_FLAGS_1 = -ffoo -lfoo

Now dynamically select one of these macros for assignment to EXTRA_FLAGS.

EXTRA_FLAGS = $(EXTRA_FLAGS_$(USE_FOO))

The assignment on the command line overrides the assignment in the Makefile, so the user gets to override USE_FOO.

$ make              # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0    # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1    # EXTRA_FLAGS = -ffoo -lfoo

Before reading the POSIX specification, I didn’t realize that the left side of an assignment can get the same treatment. For example, if I really want the “if defined” behavior back, I can use the macro to mangle the left-hand side. For example,

EXTRA_FLAGS = -O0 -g3
EXTRA_FLAGS$(DEBUG) = -O3 -DNDEBUG

Caveat: If DEBUG is set to empty, it may still result in true for ifdef depending on which make flavor you’re using, but will always appear to be unset in this hack.

$ make             # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes   # EXTRA_FLAGS = -O0 -g3

This last case had me thinking: This is very similar to the (ab)use of the x86 mov instruction in mov is Turing-complete. These macro assignments alone should be enough to compute any algorithm.

Macro Operations

Macro names are just keys to a global associative array. This can be used to build lookup tables. Here’s a Makefile to “compute” the square root of integers between 0 and 10.

sqrt_0  = 0.000000
sqrt_1  = 1.000000
sqrt_2  = 1.414214
sqrt_3  = 1.732051
sqrt_4  = 2.000000
sqrt_5  = 2.236068
sqrt_6  = 2.449490
sqrt_7  = 2.645751
sqrt_8  = 2.828427
sqrt_9  = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))

The BSD flavors of make have a -V option for printing variables, which is an easy way to retrieve output. I used an “immediate” assignment (:=) for result since some versions of make won’t evaluate the expression before -V printing.

$ make -f sqrt.mak -V result n=8
2.828427

Without -V, a default target could be used instead:

output :
        @printf "$(result)\n"

There are no math operators, so performing arithmetic requires some creativity. For example, integers could be represented as a series of x characters. The number 4 is xxxx, the number 6 is xxxxxx, etc. Addition is concatenation (note: macros can have + in their names):

A      = xxx
B      = xxxx
A+B    = $(A)$(B)

However, since there’s no way to “slice” a value, subtraction isn’t possible. A more realistic approach to arithmetic would require lookup tables.

Branching

Branching could be achieved through more lookup tables. For example,

square_0  = 1
square_1  = 2
square_2  = 4
# ...
result := $($(op)_$(n))

And called as:

$ make n=5 op=sqrt    # 2.236068
$ make n=5 op=square  # 25

Or using the DEBUG trick above, use the condition to mask out the results of the unwanted branch. This is similar to the mov paper.

result           := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))

And its usage:

$ make n=5 op=square             # 25
$ make n=5 op=square verbose=1   # square(5) = 25

What about loops?

Looping is a tricky problem. However, one of the most common build (anti?)patterns is the recursive Makefile. Borrowing from the mov paper, which used an unconditional jump to restart the program from the beginning, for a Makefile Turing-completeness I can invoke the Makefile recursively, restarting the program with a new set of inputs.

Remember the print target above? I can loop by invoking make again with new inputs in this target,

output :
    @printf "$(result)\n"
    @$(MAKE) $(args)

Before going any further, now that loops have been added, the natural next question is halting. In reality, the operating system will take care of that after some millions of make processes have carelessly been invoked by this horribly inefficient scheme. However, we can do better. The program can clobber the MAKE variable when it’s ready to halt. Let’s formalize it.

loop = $(MAKE) $(args)
output :
    @printf "$(result)\n"
    @$(loop)

To halt, the program just needs to clear loop.

Suppose we want to count down to 0. There will be an initial count:

count = 6

A decrement table:

= 5
= 4
= 3
= 2
= 1
= 0
= loop

The last line will be used to halt by clearing the name on the right side. This is three star territory.

$($($(count))) =

The result (current iteration) loop value is computed from the lookup table.

result = $($(count))

The next loop value is passed via args. If loop was cleared above, this result will be discarded.

args = count=$(result)

With all that in place, invoking the Makefile will print a countdown from 5 to 0 and quit. This is the general structure for the Game of Life macro program.

Game of Life

A universal Turing machine has been implemented in Conway’s Game of Life. With all that heavy lifting done, one of the easiest methods today to prove a language’s Turing-completeness is to implement Conway’s Game of Life. Ignoring the criminal inefficiency of it, the Game of Life Turing machine could be run on the Game of Life simulation running on make’s macro assignments.

In the Game of Life program — the one linked at the top of this article — each cell is stored in a macro named xxyy, after its position. The top-left most cell is named 0000, then going left to right, 0100, 0200, etc. Providing input is a matter of assigning each of these macros. I chose X for alive and - for dead, but, as you’ll see, any two characters permitted in macro names would work as well.

$ make 0000=X 0100=- 0200=- 0300=X ...

The next part should be no surprise: The rules of the Game of Life are encoded as a 512-entry lookup table. The key is formed by concatenating the cell’s value along with all its neighbors, with itself in the center.

The “beginning” of the table looks like this:

--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...

Note: The two right-hand X values here are the cell coming to life (exactly three living neighbors). Computing the next value (n0101) for 0101 is done like so:

n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))

Given these results, constructing the input to the next loop is simple:

args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...

The display output, to be given to printf, is built similarly:

output = $(n0000)$(n0100)$(n0200)$(n0300)...

In the real version, this is decorated with an ANSI escape code that clears the terminal. The printf interprets the escape byte (\033) so that it doesn’t need to appear literally in the source.

And that’s all there is to it: Conway’s Game of Life running in a Makefile. Life, uh, finds a way.

Mapping Multiple Memory Views in User Space

2016-04-10T21:59:16Z

Modern operating systems run processes within virtual memory using a piece of hardware called a memory management unit (MMU). The MMU contains a page table that defines how virtual memory maps onto physical memory. The operating system is responsible for maintaining this page table, mapping and unmapping virtual memory to physical memory as needed by the processes it’s running. If a process accesses a page that is not currently mapped, it will trigger a page fault and the execution of the offending thread will be paused until the operating system maps that page.

This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.

Some useful applications of this feature include:

An extremely fast, large memory “copy” by mapping the source memory overtop the destination memory.
Trivial interoperability between code instrumented with baggy bounds checking [PDF] and non-instrumented code. A few bits of each pointer are reserved to tag the pointer with the size of its memory allocation. For compactness, the stored size is rounded up to a power of two, making it “baggy.” Instrumented code checks this tag before making a possibly-unsafe dereference. Normally, instrumented code would need to clear (or set) these bits before dereferencing or before passing it to non-instrumented code. Instead, the allocation could be mapped simultaneously at each location for every possible tag, making the pointer valid no matter its tag bits.
Two responses to my last post on hotpatching suggested that, instead of modifying the instruction directly, memory containing the modification could be mapped over top of the code. I would copy the code to another place in memory, safely modify it in private, switch the page protections from write to execute (both for W^X and for other hardware limitations), then map it over the target. Restoring the original behavior would be as simple as unmapping the change.

Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.

POSIX Memory Mapping

On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions are shm_open(3), ftruncate(2), and mmap(2).

First, create a file descriptor to shared memory using shm_open. It has very similar semantics to open(2).

int shm_open(const char *name, int oflag, mode_t mode);

The name works much like a filesystem path, but is actually a different namespace (though on Linux it is a tmpfs mounted at /dev/shm). Resources created here (O_CREAT) will persist until explicitly deleted (shm_unlink(3)) or until the system reboots. It’s an oversight in POSIX that a name is required even if we never intend to access it by name. File descriptors can be shared with other processes via fork(2) or through UNIX domain sockets, so a name isn’t strictly required.

OpenBSD introduced shm_mkstemp(3) to solve this problem, but it’s not widely available. On Linux, as of this writing, the O_TMPFILE flag may or may not provide a fix (it’s undocumented).

The portable workaround is to attempt to choose a unique name, open the file with O_CREAT | O_EXCL (either atomically create the file or fail), shm_unlink the shared memory object as soon as possible, then cross our fingers. The shared memory object will still exist (the file descriptor keeps it alive) but will not longer be accessible by name.

int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
    handle_error(); // non-local exit
shm_unlink("/example");

The shared memory object is brand new (O_EXCL) and is therefore of zero size. ftruncate sets it to the desired size. This does not need to be a multiple of the page size. Failing to allocate memory will result in a bus error on access.

size_t size = sizeof(uint32_t);
ftruncate(fd, size);

Finally mmap the shared memory into place just as if it were a file. We can choose an address (aligned to a page) or let the operating system choose one for use (NULL). If we don’t plan on making any more mappings, we can also close the file descriptor. The shared memory object will be freed as soon as it completely unmapped (munmap(2)).

int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
close(fd);

At this point both a and b have different addresses but point (via the page table) to the same physical memory. Changes to one are reflected in the other. So this:

*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);

Will print out something like:

0x6ffffff0000 0x6fffffe0000 0xdeafbeef

It’s also possible to do all this only with open(2) and mmap(2) by mapping the same file twice, but you’d need to worry about where to put the file, where it’s going to be backed, and the operating system will have certain obligations about syncing it to storage somewhere. Using POSIX shared memory is simpler and faster.

Windows Memory Mapping

Windows is very similar, but directly supports anonymous shared memory. The key functions are CreateFileMapping, and MapViewOfFileEx.

First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.

size_t size = sizeof(uint32_t);
HANDLE h = CreateFileMapping(INVALID_HANDLE_VALUE,
                             NULL,
                             PAGE_READWRITE,
                             0, size,
                             NULL);

There’s no truncate step because the space is allocated at creation time via the two-part size argument.

Then, just like mmap:

uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
CloseHandle(h);

If I wanted to choose the target address myself, I’d call MapViewOfFileEx instead, which takes the address as additional argument.

From here on it’s the same as above.

Generalizing the API

Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.

int  memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);

Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.

It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.

Pass the same pointers back to memory_alias_unmap to free the mappings. When called correctly it cannot fail, so there’s no return value.

The full source is here: memalias.c

POSIX

Starting with the simpler of the two functions, the POSIX implementation looks like so:

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    for (size_t i = 0; i < naddr; i++)
        munmap(addrs[i], size);
}

The complex part is creating the mapping:

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    char path[128];
    snprintf(path, sizeof(path), "/%s(%lu,%p)",
             __FUNCTION__, (long)getpid(), addrs);
    int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd == -1)
        return -1;
    shm_unlink(path);
    ftruncate(fd, size);
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = mmap(addrs[i], size,
                        PROT_READ | PROT_WRITE, MAP_SHARED,
                        fd, 0);
        if (addrs[i] == MAP_FAILED) {
            memory_alias_unmap(size, i, addrs);
            close(fd);
            return -1;
        }
    }
    close(fd);
    return 0;
}

The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.

Otherwise it just walks the array setting up the mappings.

Windows

The Windows version is very similar.

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    (void)size;
    for (size_t i = 0; i < naddr; i++)
        UnmapViewOfFile(addrs[i]);
}

Since Windows tracks the size internally, it’s unneeded and ignored.

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
                                 NULL,
                                 PAGE_READWRITE,
                                 0, size,
                                 NULL);
    if (m == NULL)
        return -1;
    DWORD access = FILE_MAP_ALL_ACCESS;
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
        if (addrs[i] == NULL) {
            memory_alias_unmap(size, i, addrs);
            CloseHandle(m);
            return -1;
        }
    }
    CloseHandle(m);
    return 0;
}

In the future I’d like to find some unique applications of these multiple memory views.

A Basic Just-In-Time Compiler

2015-03-19T04:57:55Z

This article was discussed on Hacker News and on reddit.

Monday’s /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It’s given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it’s an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
…

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there’s less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;
};

To keep things simple here, I’m just assuming the page size is 4kB. In a real program, we’d use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It’s anonymous because it’s not backed by a file.

struct asmbuf *
asmbuf_create(void)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}

Windows doesn’t have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here’s the equivalent in Win32.

struct asmbuf *
asmbuf_create(void)
{
    DWORD type = MEM_RESERVE | MEM_COMMIT;
    return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}

Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

void
asmbuf_finalize(struct asmbuf *buf)
{
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}

Or on Win32 (that last parameter is not allowed to be NULL),

void
asmbuf_finalize(struct asmbuf *buf)
{
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}

Finally, instead of free() it gets unmapped.

void
asmbuf_free(struct asmbuf *buf)
{
    munmap(buf, PAGE_SIZE);
}

And on Win32,

void
asmbuf_free(struct asmbuf *buf)
{
    VirtualFree(buf, 0, MEM_RELEASE);
}

I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We’re only going to be concerned with three of x86-64’s many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we’ll be able to call from C like a normal function. Here’s what it’s prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that’s where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   rax, rdi

There’s a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.

ret

So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.

Fortunately there’s a much easier way. We’ll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to nasm. It will produce a raw binary with the machine code, which we’ll disassemble with nidsasm (the NASM disassembler).

$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret

That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // ret

For each operation, we’ll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator’s instruction if it’s small (32-bits or fewer), but I’m keeping it simple. To sneakily capture the “template” for this instruction I’m going to use 0x0123456789abcdef as the operand.

mov   rdi, 0x0123456789abcdef

Which disassembled with ndisasm is,

00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301

Notice the operand listed little endian immediately after the instruction. That’s also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   rdi, operand
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
    case '+':
        asmbuf_ins(buf, 3, 0x4801f8);   // add   rax, rdi
        break;
    case '-':
        asmbuf_ins(buf, 3, 0x4829f8);   // sub   rax, rdi
        break;
    case '*':
        asmbuf_ins(buf, 4, 0x480fafc7); // imul  rax, rdi
        break;
    case '/':
        asmbuf_ins(buf, 3, 0x4831d2);   // xor   rdx, rdx
        asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  rdi
        break;
}

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we’re all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.

I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

Global State: a Tale of Two Bad C APIs

2014-10-12T22:48:00Z

Mutable global variables are evil. You’ve almost certainly heard that before, but it’s worth repeating. It makes programs, and libraries especially, harder to understand, harder to optimize, more fragile, more error prone, and less useful. If you’re using global state in a way that’s visible to users of your API, and it’s not essential to the domain, you’re almost certainly doing something wrong.

In this article I’m going to use two well-established C APIs to demonstrate why global state is bad for APIs: BSD regular expressions and POSIX Getopt.

BSD Regular Expressions

The BSD regular expression API dates back to 4.3BSD, released in 1986. It’s just a pair of functions: one compiles the regex, the other executes it on a string.

char *re_comp(const char *regex);
int   re_exec(const char *string);

It’s immediately obvious that there’s hidden internal state. Where else would the resulting compiled regex object be? Also notice there’s no re_free(), or similar, for releasing resources held by the compiled result. That’s because, due to its limited design, it doesn’t hold any. It’s entirely in static memory, which means there’s some upper limit on the complexity of the regex given to this API. Suppose an implementation does use dynamically allocated memory. It seems this might not matter when only one compiled regex is allowed. However, this would create warnings in Valgrind and make it harder to use for bug testing.

This API is not thread-safe. Only one thread can use it at a time. It’s not reentrant. While using a regex, calling another function that might use a regex means you have to recompile when it returns, just in case. The global state being entirely hidden, there’s no way to tell if another part of the program used it.

Fixing BSD Regular Expressions

This API has been deprecated for some time now, so hopefully no one’s using it anymore. 15 years after the BSD regex API came out, POSIX standardized a much better API. It operates on an opaque regex_t object, on which all state is stored. There’s no global state.

int    regcomp(regex_t *preg, const char *regex, int cflags);
int    regexec(const regex_t *preg, const char *string, ...);
size_t regerror(int errcode, const regex_t *preg, ...);
void   regfree(regex_t *preg);

This is what a good API looks like.

Getopt

POSIX defines a C API called Getopt for parsing command line arguments. It’s a single function that operates on the argc and argv values provided to main(). An option string specifies which options are valid and whether or not they require an argument. Typical use looks like this,

int main(int argc, char **argv)
{
    int option;
    while ((option = getopt(argc, argv, "ab:c:d")) != -1) {
        switch (option) {
            case 'a':
            /* ... */
        }
    }
    /* ... */
    return 0;
}

The b and c options require an argument, indicated by the colons. When encountered, this argument is passed through a global variable optarg. There are four external global variables in total.

extern char *optarg;
extern int optind, opterr, optopt;

If an invalid option is found, getopt() will automatically print a locale-specific error message and return ?. The opterr variable can be used to disable this message and the optopt variable is used to get the actual invalid option character.

The optind variable keeps track of Getopt’s progress. It slides along argv as each option is processed. In a minimal, strictly POSIX-compliant Getopt, this is all the global state required.

The argc value in main(), and therefore the same parameter in getopt(), is completely redundant and serves no real purpose. Just like the C strings it points to, the argv vector is guaranteed to be NULL-terminated. At best it’s a premature optimization.

Threading an Reentrancy

The most immediate problem is that the entire program can only parse one argument vector at a time. It’s not thread-safe. This leaves out the possibility of parsing argument vectors in other threads. For example, if the program is a server that exposes a shell-like interface to remote users, and multiple threads are used to handle those requests, it won’t be able to take advantage of Getopt.

The second problem is that, even in a single-threaded application, the program can’t pause to parse a different argument vector before returning. It’s not reentrant. For example, suppose one of the arguments to the program is a string containing more arguments to be parsed for some subsystem.

#  -s    Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"

In theory, the value of optind could be saved and restored. However, this isn’t portable. POSIX doesn’t explicitly declare that the entire state is captured by optind, nor is it required to be. Implementations are allowed to have internal, hidden global state. This has implications in resetting Getopt.

Resetting Getopt

In a minimal, strict Getopt, resetting Getopt for parsing another argument vector is just a matter of setting optind to back to its original value of 1. However, this idiom isn’t portable, and POSIX provides no portable method for resetting the global parser state.

Real implementations of Getopt go beyond POSIX. Probably the most popular extra feature is option grouping. Typically, multiple options can be grouped into a single argument, so long as only the final option requires an argument.

$ myprogram -adb foo

After processing a, optind cannot be incremented, because it’s still working on the first argument. This means there’s another internal counter for stepping across the group. In glibc this is called nextchar. Setting optind to 1 will not reset this internal counter, nor would it be detectable by Getopt if it was already set to 1. The glibc way to reset Getopt is to set optind to 0, which is otherwise an invalid value. Some other Getopt implementations follow this idiom, but it’s not entirely portable.

Not only does Getopt have nasty global state, the user has no way to reliably control it!

Error Printing

I mentioned that Getopt will automatically print an error message unless disabled with opterr. There’s no way to get at this error message, should you want to redirect it somewhere else. It’s more hidden, internal state. You could write your own message, but you’d lose out on the automatic locale support.

Fixing Getopt

The way Getopt should have been designed was to accept a context argument and store all state on that context. Following other POSIX APIs (pthreads, regex), the context itself would be an opaque object. In typical use it would have automatic (i.e. stack) duration. The context would either be zero initialized or a function would be provided to initialize it. It might look something like this (in the zero-initialized case).

int getopt(getopt_t *ctx, char **argv, const chat *optstring);

Instead of optarg and optopt global variables, these values would be obtained by interrogating the context. The same applies for optind and the diagnostic message.

const char *getopt_optarg(getopt_t *ctx);
int         getopt_optopt(getopt_t *ctx);
int         getopt_optind(getopt_t *ctx);
const char *getopt_opterr(getopt_t *ctx);

Alternatively, instead of getopt_optind() the API could have a function that continues processing, but returns non-option arguments instead of options. It would return NULL when no more arguments are left. This is the API I’d prefer, because it would allow for argument permutation (allow options to come after non-options, per GNU Getopt) without actually modifying the argument vector. This common extension to Getopt could be added cleanly. The real Getopt isn’t designed well for extension.

const char *getopt_next_arg(getopt_t *ctx);

This API eliminates the global state and, as a result, solves all of the problems listed above. It’s essentially the same API defined by Popt and my own embeddable Optparse. They’re much better options if the limitations of POSIX-style Getopt are an issue.

Pseudo-terminals

2012-04-23T00:00:00Z

My dad recently had an interesting problem at work related to serial ports. Since I use serial ports at work, he asked me for advice. They have third-party software which reads and analyzes sensor data from the serial port. It’s the only method this program has of inputting a stream of data and they’re unable to patch it. Unfortunately, they have another piece of software that needs to massage the data before this final program gets it. The data needs to be intercepted coming on the serial port somehow.

The solution they were aiming for was to create a pair of virtual serial ports. The filter software would read data in on the real serial port, output the filtered data into a virtual serial port which would be virtually connected to a second virtual serial port. The analysis software would then read from this second serial port. They couldn’t figure out how to set this up, short of buying a couple of USB/serial port adapters and plugging them into each other.

It turns out this is very easy to do on Unix-like systems. POSIX defines two functions, posix_openpt(3) and ptsname(3). The first one creates a pseudo-terminal — a virtual serial port — and returns a “master” file descriptor used to talk to it. The second provides the name of the pseudo-terminal device on the filesystem, usually named something like /dev/pts/5.

#define _GNU_SOURCE
#include 
#include 
#include 

int main()
{
    int fd = posix_openpt(O_RDWR | O_NOCTTY);
    printf("%s\n", ptsname(fd));
    /* ... read and write to fd ... */
    return 0;
}

The printed device name can be opened by software that’s expecting to access a serial port, such as minicom, and it can be communicated with as if by a pipe. This could be useful in testing a program’s serial port communication logic virtually.

The reason for the unusually long name is because the function wasn’t added to POSIX until 1998 (Unix98). They were probably afraid of name collisions with software already using openpt() as a function name. The GNU C Library provides an extension getpt(3), which is just shorthand for the above.

int fd = getpt();

Pseudo-terminal functionality was available much earlier, of course. It could be done through the poorly designed openpty(3), added in BSD Unix.

int openpty(int *amaster, int *aslave, char *name,
            const struct termios *termp,
            const struct winsize *winp);

It accepts NULL for the last three arguments, allowing the user to ignore them. What makes it so bad is that string name. The user would pass it a chunk of allocated space and hope it was long enough for the file name. If not, openpty() would overwrite the end of the string and trash some memory. It’s highly unlikely to ever exceed something like 32 bytes, but it’s still a correctness problem.

The newer ptsname() is only slightly better however. It returns a string that doesn’t need to be free()d, because it’s static memory. However, that means the function is not re-entrant; it has issues in multi-threaded programs, since that string could be trashed at any instant by another call to ptsname(). Consider this case,

int fd0 = getpt();
int fd1 = getpt();
printf("%s %s\n", ptsname(fd0), ptsname(fd1));

ptsname() will be returning the same char * pointer each time it’s called, merely filling the pointed-to space before returning. Rather than printing two different device filenames, the above would print the same filename twice. The GNU C Library provides an extension to correct this flaw, as ptsname_r(), where the user provides the memory as before but also indicates its maximum size.

To make a one-way virtual connection between our pseudo-terminals, create two of them and do the typical buffer thing between the file descriptors (for succinctness, no checking for errors),

while (1) {
    char buffer;
    int in = read(pt0, &buffer, 1);
    write(pt1, &buffer, in);
}

Making a two-way connection would require the use of threads or select(2), but it wouldn’t be much more complicated.

While all this was new and interesting to me, it didn’t help my dad at all because they’re using Windows. These functions don’t exist there and creating virtual serial ports is a highly non-trivial, less-interesting process. Buying the two adapters and connecting them together is my recommended solution for Windows.